treelabel/README.Rmd at main · const-ae/treelabel · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
options(width = 100)
```

# treelabel

<!-- badges: start -->
<!-- badges: end -->

The goal of treelabel is to store and work with labels that exist in a hierarchical relationship.

This is an alpha software release: feel free to play around with the code, and please provide feedback, but expect breaking changes to the API!

## Motivation

![](man/figures/celltype_tree.png)

I work on single-cell RNA-seq data with gene expression profiles for thousands of cells. A common first step is to annotate each cell's *cell type*. The granularity of these cell type annotations can vary; one can classify cells broadly into *immune cells* or *epithelial cells* or one can be very detailed and distinguish within the immune cells *CD4 positive T regulatory cells* from *CD4 positive T follicular helper cells*. Choosing the best annotation level is difficult because one analysis may need broad cell types, whereas others require the highest possible resolution. The `treelabel` package provides an intuitive interface to store and work with these hierarchically related labels.

Depending on the reference data and annotation method used for the cell typing, you often have multiple (partially) conflicting annotations. `treelabel` provides functions to build a consensus across different annotations and can integrate annotations at different resolutions. Furthermore, `treelabel` supports uncertainty scores associated with a label. For example, most automatic cell type scoring tools (like [Azimuth](https://azimuth.hubmapconsortium.org/) or [celltypist](https://www.celltypist.org/)), return a confidence score in addition to the cell type label. These scores enable a more precise selection of cells where you have sufficient confidence in the cell type label.

## What this package is. And what it is not,

This package is purposefully kept generic and only makes the following assumptions:

- Your labels have a tree-like relationship: the edges between the labels are directed, and there are no cycles.
- The relation between a parent and a child can phrased as *is a*. For example, a *`T cell` is a `Immune cell`*.
- The scores can be logical or non-negative numbers.

This package does not provide any functionality to:

- Assign cell types to cells based on the expression profile. Use any of the many available automatic cell type scoring tools (see for [this](https://github.com/seandavi/awesome-single-cell?tab=readme-ov-file#cell-type-identification-and-classification) list for some suggestions) or do it manually using clustering plus marker gene expression.
- Automatically harmonize cell type labels from different references (e.g., figure out that the `NK cells` from one dataset correspond to the `Natural killer cells` from another). You have to do this manually. There is an example further down in the README.
- Provide the optimal cell type tree. You will probably want to define the tree for your analysis depending on the annotations available to you. As a reference, look at the [cell ontology project](https://cell-ontology.github.io/), which provides a large database of cell type label relationships and is used by the Human Cell Atlas.
- Plot trees. For demonstration purposes, we will use the `igraph` plots (which are not very pretty), and for the plot on the top, I used the [D3](https://d3js.org/) library from Javascript (which is cumbersome to use from R). See the end of the README for an example how to make pretty plots of trees with ggplot2.

## Installation

You can install the development version of `treelabel` like this:

``` r
devtools::install_github("const-ae/treelabel")
```

# Documentation

`treelabel` is build to be directly compatible with the `tidyverse`.

```{r, eval=FALSE}
library(treelabel)
library(tidyverse)
```
```{r, include=FALSE}
devtools::load_all(".")
library(tidyverse)
options(
  pillar.print_max = 5,
  pillar.print_min = 5
)
```

## Motivating example

I will demonstrate a typical single cell analysis workflow that takes a hierarchical set of labels, stores them in a `treelabel` vector, and analyzes the abundance changes of cell types.

I will illustrate the process using the "pbmcsca" dataset from Seurat and score each cell using Azimuth. Note that `treelabel` is compatible with any data storage format (e.g., `SingleCellExperiment` or `Seurat`) and can handle both manual cell type labels based on clustering or automated cell scores produced by, for example, Azimuth or Celltypist.

```{r message=FALSE, warning=FALSE, paged.print=FALSE, results='hide'}
# This can take a minute to run through
library(Seurat)
# Might need to call `SeuratData::InstallData("pbmcsca")` first
pbmcsca <- SeuratData::LoadData("pbmcsca")
azimuth_res <- Azimuth::RunAzimuth(pbmcsca, reference = "pbmcref")
# Select most important columns to make the output easier to read.
azimuth_res@meta.data <- select(azimuth_res@meta.data, c("orig.ident", "Experiment", "Method", starts_with("predicted.celltype.")))
```

Take a look at the Azimuth output. It contains six new columns that all start with "predicted.celltype". These columns contain the labels and confidence scores from the automated mapping. These will serve as the input to `treelabel` which turns the six columns into one!

```{r, paged.print=FALSE}
azimuth_res@meta.data |>
  as_tibble(rownames = "cell_id")
```

To create the `treelabel` vector, we need to define our cell type hierarchy. We use the `igraph` packages for this.

```{r, paged.print=FALSE}
# Define the cell type label hierarchy
pbmcsca_tree <- igraph::graph_from_literal(
  root - NK : `T cell` : Mono : DC : B,
  `T cell` - `CD8 T` : `CD4 T`,
  `CD4 T` - CTL : `CD4 Naive` : `CD4 TCM` : `CD4 TEM` : Treg,
  `CD8 T` - `CD8 Naive` : `CD8 TCM` : `CD8 TEM`,
  DC - cDC1 : cDC2 : pDC,
  Mono - `CD14 Mono` : `CD16 Mono`
)

# Convert the Azimuth result into a treelabel vector:
# * We first append '.label' to the Azimuth column to make the `pivot_longer` simpler.
# * We then filter the cell types to the ones we list in the `pbmcsca_tree`.
# * Lastly, we convert the label and score column into a treelabel vector.
celltype_annotations <- azimuth_res@meta.data |>
  as_tibble(rownames = "cell_id") |>
  dplyr::rename_with(.cols = matches("^predicted.celltype.l\\d$"), \(x) paste0(x, ".label")) |>
  pivot_longer(starts_with("predicted.celltype"), names_prefix = "predicted\\.celltype\\.", names_sep = "\\." ,names_to = c("level", ".value")) |>
  filter(label %in% igraph::V(pbmcsca_tree)$name) |>
  treelabel_from_dataframe(pbmcsca_tree, id = "cell_id", label = "label", score = "score", name = "azimuth_celltypes")
```

The `azimuth_celltypes` column is an S3 vector (build ontop of the `vctrs` package) which works a bit like a `factor`. Each entry contains the full information about the hierarchical cell type labels. By default, `treelabel` prints the most precise cell type label that is available, but the information about the other levels is still accessible

```{r, paged.print=FALSE}
vec <- head(celltype_annotations$azimuth_celltypes, n = 8)
vec

# The fourth element is a T cell, a CD8 T, and a CD8 TEM!
vec[4]
tl_eval(vec[4], `T cell`)
tl_eval(vec[4], `CD8 T`)
tl_eval(vec[4], `CD8 TEM`)
```

We can join the `celltype_annotations` with the full meta data to replace the old `predicted.celltype` column with our new `treelabel`.

```{r, paged.print=FALSE}
meta_data <- azimuth_res@meta.data |>
  as_tibble(rownames = "cell_id") |>
  dplyr::select(- starts_with("predicted.celltype")) |>
  left_join(celltype_annotations, by = "cell_id")

meta_data
```

The `pbmcsca` contains results from ten different sequencing experiments. We can for example count how often each cell type was seen per method

```{r, paged.print=FALSE}
# Summing the confidence scores does not exactly give you the counts
meta_data |>
  summarize(as_tibble(tl_score_matrix(sum(azimuth_celltypes, na.rm=TRUE))), .by = c(Method))

# Instead, you can say that only labels where the score exceeds a thresholds count.
meta_data |>
  mutate(azimuth_celltypes = tl_modify(azimuth_celltypes, .scores > 0.8)) |>
  summarize(as_tibble(tl_score_matrix(sum(azimuth_celltypes, na.rm=TRUE))), .by = c(Method))
```

The `test_abundance_changes` function makes it easy to test if the number of cells of a cell type changes between conditions. Importantly, you need to have multiple independent replicates. The `pbmcsca` unfortunately does not have that, so I will just simulate random patient IDs to demonstrate how the function works.

```{r, paged.print=FALSE}
# Apply threshold and make Experiment a factor
input_dat <- meta_data |>
  mutate(Experiment = as.factor(Experiment)) |>
  mutate(patient_id = sample(paste0("sample_", 1:5), size = n(), replace = TRUE)) |>
  mutate(azimuth_celltypes = tl_modify(azimuth_celltypes, .scores > 0.8))

# The function takes many arguments. See `?test_abundance_changes` for all details
test_abundance_changes(input_dat, design = ~ Experiment, aggregate_by = patient_id)

# We can also run `test_abundance_changes` inside dplyr::reframe (an alternative to `summarize`)
# and calculate the abundance changes for each Method separately.
# Setting `reference = `T cell` will test if the number of T cell subtypes changes as a proportion
# of all T cells.
input_dat |>
  reframe(test_abundance_changes(data = across(everything()), design = ~ Experiment, aggregate_by = patient_id,
                                 reference = `T cell`, contrast = cond(Experiment = 'pbmc2') - cond(Experiment = 'pbmc1')),
          .by = Method)
```

### Compatibility with Bioconductor and Seurat

`treelabel` works directly with the BioConductor data structures `SingleCellExperiment`, `SummarizedExperiment`, and `DataFrame`.

```{r}
# Load an example SingleCellExperiment object
suppressMessages({
  sce <- ExperimentHub::ExperimentHub()[["EH2259"]]
})
# Make a simple tree with only one level
kang_tree <- igraph::graph_from_edgelist(cbind("root", levels(sce$cell)))
# Add treelabel column to colData
colData(sce)$treelabel <- treelabel(sce$cell, kang_tree)
colData(sce)
```

The `treelabel` vectors can also be used with Seurat data. Here, we match the provided annotations to the names from the `pbmcsca_tree`.


```{r eval=FALSE, include=TRUE}
# Load pbmcsca again
library(Seurat)
pbmcsca <- SeuratData::LoadData("pbmcsca")
```

```{r}
# We will re-use the `pbmcsca_tree` from above. The provided annotations in pbmcsca$CellType
# are in a slightly different format, so we manually convert them.
rename_pbmcsca_celltypes <- c(
  "B cell" = "B", "CD14+ monocyte" = "CD14 Mono", "CD16+ monocyte" = "CD16 Mono",
  "CD4+ T cell" = "CD4 T", "Cytotoxic T cell" = "CD8 T",  "Dendritic cell" = "DC",
  "Megakaryocyte" = "Mono", "Natural killer cell" = "NK",
  "Plasmacytoid dendritic cell" = "pDC", "Unassigned" = "root"
)

pbmcsca@meta.data$tl_manual <- treelabel(rename_pbmcsca_celltypes[pbmcsca$CellType], pbmcsca_tree)
pbmcsca@meta.data[1:5,c("orig.ident", "CellType", "tl_manual")]
```


## Technical documentation

We define our label hierarchy using [`igraph`](https://r.igraph.org/articles/igraph.html).

```{r tree_plot}
tree <- igraph::graph_from_literal(
  root - ImmuneCell : EndothelialCell : EpithelialCell,
  ImmuneCell - TCell : BCell,
  TCell - CD4_TCell : CD8_TCell
)
plot(tree, layout = igraph::layout_as_tree(tree, root = "root"),
     vertex.size = 40, vertex.label.cex = 0.6)
```

### Constructors

The easiest way to make a `treelabel` vector is to make one from a character vector. You call the `treelabel` constructor and provide the labels and the reference tree

```{r}
char_vec <- c("BCell", "EndothelialCell", "CD4_TCell", NA, "BCell", "EpithelialCell", "ImmuneCell")
vec <- treelabel(char_vec, tree = tree)
vec
```


If you have some uncertainty associated with each label, you can also use a named `numeric` vector to make a `treelabel` vector.

```{r}
num_vec <- c("BCell" = 0.99, "EndothelialCell" = 0.6, "CD4_TCell" = 0.8, NA, "BCell" = 0.78, "EpithelialCell" = 0.9, "ImmuneCell" = 0.4)
vec <- treelabel(num_vec, tree = tree)
vec
```


Some tools provide the confidence scores for each vertex in the tree. In this case, you can provide the annotations as a `list ` or a `data.frame`

```{r}
lst <- list(
  c(BCell = 0.99, ImmuneCell = 1),
  c(root = 1, EndothelialCell = 0.65),
  c(CD4_TCell = 0.8, TCell = 0.95, ImmuneCell = 0.95),
  NULL, # will be treated as NA
  c(ImmuneCell = 0.4)
)

vec <- treelabel(lst, tree)
vec
```

Lastly, you can convert a "tidy" data frame to a treelabel. The `treelabel_from_dataframe` works differently from the other constructors, as it returns a `data.frame` with an ID column and a `treelabel` column. The function cannot directly return a `treelabel` vector because the order of the rows in the data.frame could be scrambled, in which case it is unclear how cells and elements in the treelabel relate.

```{r}
df <- data.frame(
  cell_id = c("cell 1", "cell 1", "cell 2", "cell 3", "cell 3", "cell 3"),
  annot = c("BCell", "ImmuneCell", NA, "TCell", "CD4_TCell", "ImmuneCell"),
  confidence = c(0.99, 1, NA, 0.95, 0.8, 0.95)
)
df <- treelabel_from_dataframe(df, tree, id = "cell_id", label = "annot", score = "confidence")
df
```


### Working with the `treelabel` vector

The `treelabel` vectors can be indexed or concatenated like any regular R vector:

```{r}
vec
length(vec)
vec[2]
vec[1:4]
c(vec, vec[1:3])
```

You can extract the tree from a `treelabel` and the name of the tree root.

```{r}
tl_tree(vec)
tl_tree_root(vec)
```

The easiest way to get the score for a particular label inside a `treelabel` vector is to use `$`

```{r}
vec$ImmuneCell
vec$CD4_TCell
```

### Testing the identity

The printing function builds on the `tl_name`, which returns the vertex furthest from the root that is not `NA`. We can change this threshold. For example, for the third cell the *CD4_TCell* label does not pass the `0.9` threshold, but the *TCell* label does.

```{r, paged.print=FALSE}
tibble(vec, tl_name(vec), tl_name(vec, threshold = 0.9))
```

You can also evaluate arbitrary expressions using `tl_eval`.

```{r, paged.print=FALSE}
tibble(vec) |> mutate(is_tcell = tl_eval(vec, TCell > 0.9))
```

`treelabel` is clever about evaluating these expressions. If, for example, we ask if the cell might be a T cell (i.e., `TCell > 0.2`), the second and fifth entries switch from `FALSE` to `NA`.

```{r, paged.print=FALSE}
tibble(vec) |> mutate(maybe_tcell = tl_eval(vec, TCell > 0.2))
```

To understand why, let's look at how `treelabel` internally stores the data.  Internally, the scores are stored as a matrix with one column for each label, and the scores that were not specified are stored as `NA`.

```{r}
tl_score_matrix(vec)
```

For each missing element, we can give a lower and upper bound for the value. For the fifth element the confidence that it is an `ImmuneCell` is `tl_get(vec[5], "ImmuneCell")` = `r tl_get(vec[5], "ImmuneCell")`. This means that each child can also be at most `0.4`.

The general formula is that the score for a vertex `v` that is `NA` can be at most (in pseudocode): `max(0, score(parent(v)) - sum(children(parent(v)), na.rm=TRUE))`.

```{r}
# tl_atmost is clever
tl_atmost(vec) |> tl_score_matrix()
# tl_atleast simply replaces `NA`'s with zeros
tl_atleast(vec) |> tl_score_matrix()
```

The `tl_eval` function evaluates its arguments for `tl_atmost(x)` and `tl_atleast(x)`. If the results agree, that value is returned; if not, `tl_eval` returns `NA`. A word of caution: this function can give surprising results if multiple label references occur in the expression.

```{r}
t1 <- treelabel(list(c("TCell" = 0.8)), tree)
# Ideally both function calls would return `NA`
tl_eval(t1, CD4_TCell > CD8_TCell)
tl_eval(t1, CD4_TCell < CD8_TCell)
```


### Arithmetic

You can combine two vectors or summarize across elements. You can do whatever calculations you want, and `treelabel` will try to make the right thing happen. You can also do problematic things like produce negative values. `treelabel` currently does not stop you, but this breaks one of the assumptions of `treelabel`.

```{r, paged.print=FALSE}
vec2 <- treelabel(c("BCell" = 0.8, "EpithelialCell" = 0.3, "TCell" = 0.9, "CD8_TCell" = 0.2, "TCell" = 0.8), tree)
tibble(vec, vec2) |>
  mutate(arithmetic_mean = (vec + vec2) / 2,
         geometric_mean = (vec * vec2)^(1/2),
         rounding = round(vec))
```

### Modification

You can modify the elements of a `treelabel` vector. The easiest is way is to use `if_else` (note `ifelse` does not work!!) and mix the content of two vectors. Alternatively, you can set elements to `NA`.

```{r}
high_quality_res <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
# Combine two vectors or set one to 'NA'
if_else(high_quality_res, vec, vec2)
if_else(high_quality_res, vec, NA)
```

If you want to modify the content within a tree, that is change the value of individual vertices, you can use the `tl_modify` function.

```{r}
# The effect of tl_modify is best understood by considering the underlying score matrix
tl_score_matrix(vec)[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = 0.3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = ImmuneCell / 3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = root - ImmuneCell/3))[,1:3]
tl_score_matrix(tl_modify(vec, ImmuneCell = NA, .propagate_NAs_down = TRUE))
tl_score_matrix(tl_modify(vec, ImmuneCell = NA, .propagate_NAs_down = FALSE))
```

#### Tree modifications

Sometimes you don't want to change the values within the tree, but change the tree structure or only work on a selected branch. The `tl_tree_modify` allows you to set a completely new tree structure and only retain values for vertices that occurr in both the new and old tree.

```{r}
subtree <- igraph::graph_from_literal(
  root - CD4_TCell : CD8_TCell : EndothelialCell : EpithelialCell
)

tl_score_matrix(vec)
tl_tree_modify(vec, subtree) |> tl_score_matrix()
```

Sometimes, you only want to work on a single branch of the tree. You can do this using the `tl_tree_filter` and `tl_tree_cut` functions.

```{r}
# Select all T cells
tl_tree_cut(vec, new_root = "TCell")

# This does the same, but leaves the old root
tl_tree_filter(vec, \(names) grepl("TCell", names))
```


### Consensus construction

`treelabel` provides functions to make it easy to apply expression across `treelabel` columns. These functions are built on top of [`dplyr::across`](https://dplyr.tidyverse.org/reference/across.html). They take as the first argument a specification of columns (e.g., `where(is_treelabel)` or `starts_with("label_")`). The second argument is evaluated internally with `tl_eval`.

```{r, paged.print=FALSE}
dat <- tibble(cell_id = paste0("cell_", 1:5), vec, vec2)

dat |> mutate(is_immune = tl_across(where(is_treelabel), ImmuneCell > 0.7))
dat |> mutate(immune_counts = tl_sum_across(where(is_treelabel), ImmuneCell > 0.7))
dat |> mutate(mean_immune_score = tl_mean_across(where(is_treelabel), ImmuneCell))
dat |> filter(tl_if_all(where(is_treelabel), ImmuneCell > 0.7))
```

In addition, we can also work on the whole matrix of values per tree label and combine them.

```{r, paged.print=FALSE}
dat |>  mutate(consensus = tl_mean_across(c(vec,vec2)))

dat |>
  mutate(across(c(vec, vec2), \(x) tl_modify(x, .scores > 0.5))) |>
  mutate(consensus = tl_sum_across(c(vec,vec2)))
```


### Pretty plotting

The following visualization is inspired by the default tree visualization in D3.

```{r}
#' Calculate layout of tree using igraph and return results as two tibbles.
prepare_tree_for_plotting <- function(tree, tree_root = "root"){
  tree <- .make_tree(tree, root = tree_root)

  layout <- igraph::layout_as_tree(tree, root = tree_root)

  children <- lapply(igraph::V(tree), \(v){
    igraph::neighbors(tree, v, mode = "out")$name
  })

  vertices <- igraph::V(tree)$name
  nodes <- tibble(node = vertices,
         distance_to_root = max(layout[,2]) - layout[,2],
         position = layout[,1],
         is_leaf = vapply(children, \(x) length(x) == 0, FUN.VALUE = logical(1L)))

  edges <- edges <- tibble(node = vertices,
       child = unname(children)) |>
    unnest(child) |>
    left_join(nodes, by = c("node" = "node")) |>
    left_join(nodes, by = c("child" = "node"), suffix = c(".node", ".child"))

  list(nodes = nodes, edges = edges)
}
```

Make the plot. The [ggbezier](https://github.com/const-ae/ggbezier) is not on CRAN yet.

```{r ggplot_code, fig.height=2}
pl_tree <- prepare_tree_for_plotting(tree)

ggplot(data = pl_tree$nodes, aes(x = distance_to_root, y = position)) +
  ggbezier::geom_bezier(data = pl_tree$edges |>  pivot_longer(c(ends_with(".node"), ends_with(".child")), names_sep = "\\.", names_to = c(".value", "side")),
                        aes(x = distance_to_root, y = position, x_handle1 = distance_to_root - 0.4,
                            x_handle2 = distance_to_root + 0.4, y_handle1 = position, y_handle2 = position, group = paste0(node, "-", child)),
                        show_handles = FALSE, color = "lightgrey", linewidth = 0.3) +
  geom_point(aes(color = I(ifelse(is_leaf, "lightgrey", "#4e4e4e")))) +
  shadowtext::geom_shadowtext(aes(label = node, hjust = ifelse(is_leaf, 0, 1), x = distance_to_root + ifelse(is_leaf, 0.03, -0.03)),
                              color = "black", bg.colour = "white") +
  scale_x_continuous(expand = expansion(add = c(0.5, 0.9))) +
  theme_void()
```


## Session Info

```{r}
sessionInfo()
```