MapMyCells enables mapping of single cell and spatial trancriptomics data sets to a human middle temporal gyrus (MTG) taxonomy. The taxonomy is derived and presented in “Integrated multimodal cell atlas of Alzheimer’s disease” (https://doi.org/10.1101/2023.05.08.539485), and we encourage you to cite this work if you use MapMyCells to transfer these labels to your data. It is also used throughout the web tools on SEA-AD.org. This R workbook illustrates a common use case for the MapMyCells facility (https://knowledge.brain-map.org/mapmycells/process/) and follow up analyses. The query data included in this document are from the paper “Conserved cell types with divergent features in human versus mouse cortex ” (https://doi.org/10.1038/s41586-019-1506-7) from the Allen Institute, which also has a nice data exploration tool (RNA-Seq Data Navigator; https://celltypes.brain-map.org/rnaseq/human/mtg).
This example is run and R, which can be downloaded at CRAN (https://cran.r-project.org/). To run in R, just sequentially copy and paste the relevant code blocks into R. An easier alternative is to download RStudio here (https://posit.co/download/rstudio-desktop/) after downloading R. “.Rmd” files can be directly loaded into R studio and run.
As an example data set, we will use the a subset of the snRNA-seq dataset published in Hodge, et al. (2019), which has been conveniently packaged into an R library. (To see how one would perform this step from a 10X experiment, see the version of this tutorial for whole mouse brain.)
if(!is.element("hodge2019data",.packages(all.available = TRUE))) # Install data package if not already installed
devtools::install_github("AllenInstitute/hodge2019data")
The next component of set up is to make sure your R working directory points to the data you are reading in.
# Uncomment line below and replace bracketed text with path to downloaded files.
#setwd("FILE_PATH")
This workbook uses the libraries anndata, Seurat, and hodge2019data.
# Install Seurat and anndata if needed
list.of.packages <- c("Seurat", "anndata")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)>0) install.packages(new.packages)
suppressPackageStartupMessages({
library(Seurat) # For reading droplet data sets and visualizing/comparing results
library(anndata) # For writing h5ad files
library(hodge2019data) # For the query data set
})
## Warning: package 'Seurat' was built under R version 4.2.3
## Warning: package 'anndata' was built under R version 4.2.3
options(stringsAsFactors=FALSE)
# Citing R libraries
# citation("Seurat") # Note that the citation for any R library can be pulled up using the citation command. We encourage citation of R libraries as appropriate.
With the above libraries loaded, the use case below can now be run.
(If you already have an .h5ad file ready for upload, skip to step 4.)
A common use case for analysis of single cell/nucleus transcriptomics is to collect data from several thousand cells, either though one (or more) ports of a droplet-based scRNA-seq run or using single well per cell SMART-Seq sequencing. After some QC steps, these are then clustered for defining cell types. This section describes how to start from such a data set, transfer labels from human “Human-MTG-10x_SEA-AD” data from the Allen Institute / SEA-AD onto these cells using MapMyCells, and then visualize the results in a UMAP.
With the data library loaded, these are already read in. We just rename for convenience.
dataIn <- data_Hodge2019 # These are the counts
annoIn <- metadata_Hodge2019 # These are the metadata (e.g., cluster assignments)
dim(dataIn)
## [1] 36591 3357
This reads in a data matrix with 3,357 nuclei.
The above data has already been carefully QC’ed, and therefore a QC step is unnecessary. However, in a typical experiment a QC step would be needed at this point. For demonstrative purposes, we will define all barcodes with >250 reads as interesting “cells”, but in a real experiment more careful QC is strongly encouraged.
dataQC <- dataIn[,colSums(dataIn)>250]
dim(dataQC)
## [1] 36591 3357
This step retains all cells for this example, but in a droplet-based experiment it would typically bring the input library down from a huge number to a more reasonable number.
Now let’s output the QC’ed data matrix into an h5ad file and output it to the current directory for upload to MapMyCells. Note that in anndata data structure the genes are saved as columns rather than genes so we need to transpose the matrix first.
# Transpose data
dataQCt = Matrix::t(dataQC)
# Convert to anndata format
ad <- AnnData(
X = dataQCt,
obs = data.frame(group = rownames(dataQCt), row.names = rownames(dataQCt)),
var = data.frame(type = colnames(dataQCt), row.names = colnames(dataQCt))
)
# Write to compressed h5ad file
write_h5ad(ad,'Hodge2019.h5ad',compression='gzip')
# Check file size. File MUST be <500MB to upload for MapMyCells
print(paste("Size in MB:",round(file.size("Hodge2019.h5ad")/2^20)))
## [1] "Size in MB: 73"
If you have trouble accessing or running the previous block, check your Python accessibility in R and consider installing python as shown below. If you use Windows, you may be asked to install git before installing python, which can be accessed here: https://git-scm.com/download/win/.
library(reticulate)
version <- "3.9.12"
install_python(version)
virtualenv_create("my-environment", version = version)
use_virtualenv("my-environment")
These next steps are performed OUTSIDE of R in the MapMyCells web application.
The steps to MapMyCells are as follows:
Let’s now look at the output results from the hierarchical clustering algorithm. We can read this into R using read.csv, but note that the first few lines contain metadata that need to be skipped, and begin with a ‘#’ character.
mapping <- read.csv("Hodge2019_mapping.csv",comment.char="#")
head(data.frame(mapping))
|
cell_id |
class_label |
class_name |
class_bootstrapping_probability |
|
---|---|---|---|---|---|
1 | LS-15093_S11_E1-50 | CS20230505_CLAS_0001 | Neuronal: GABAergic | 1 | |
2 | LS-15093_S13_E1-50 | CS20230505_CLAS_0002 | Neuronal: Glutamatergic | 1 | |
3 | LS-15093_S21_E1-50 | CS20230505_CLAS_0002 | Neuronal: Glutamatergic | 1 | |
4 | LS-15093_S26_E1-50 | CS20230505_CLAS_0002 | Neuronal: Glutamatergic | 1 | |
5 | LS-15093_S37_E1-50 | CS20230505_CLAS_0002 | Neuronal: Glutamatergic | 1 | |
6 | LS-15093_S39_E1-50 | CS20230505_CLAS_0002 | Neuronal: Glutamatergic | 1 |
6 rows | 1-5 of 11 columns [This notebook preview is limited to the first 5 columns. Run the notebook in your own environment to see all columns.]
In this taxonomy, MapMyCells maps input cells to the taxonomy at three increasing levels of resolution from coarsest class, to intermediate subclass, and finest supertype. In this human SEA-AD taxonomy there are 3 classes, 24 subclasses, and 139 supertypes.
The file consists of the following columns:
As this data set is previously published, we already know the expected subclasses for each cell. Let’s check how well the mapped and clustered subclass calls match. Recall that we defined metadata as “annoIn” above. (This step will only work if your query data has previously assigned cell types, and may need slight modifications.)
comparison = table(mapping$subclass_name,annoIn$subclass)
heatmap(comparison, ylab = "Mapping", xlab = "Clustering",margins = c(10,10))
There is extremely good agreement, with most of the boxes showing all (red) or none (yellow) of the cells mapped to the same SEA-AD subclass coming from a given Hodge subclass call. (It turns out that the disagreements are largely due to differences in how subclasses are named or how their borders are defined between 2019 and 2023.)
Using the same strategy, we can see how clusters from Hodge et al 2019 correspond to supertypes from SEA-AD.
comparison = table(mapping$supertype_name,annoIn$cluster_label)
heatmap(comparison, ylab = "Mapping", xlab = "Clustering",margins = c(8,8), cexRow = 0.5, cexCol = 0.5)
To do this properly would require more complicated analysis, but the main take-home here is that there is strong correspondence between initial cluster calls and mapping supertype assignments (lots of red and yellow, minimal intermediate colors).
To visualize the mapping results, we need both the mapping results and the original query cellxgene matrix for comparison. If this is not already read in, you can read it in from the anndata object uploaded to MapMyCells.
# Since the query data corresponds to dataQC above, we will call it dataQC again
dataQC_h5ad <- read_h5ad('Hodge2019.h5ad')
dataQC <- t(as.matrix(dataQC_h5ad$X))
rownames(dataQC) <- rownames(dataQC_h5ad$var)
colnames(dataQC) <- rownames(dataQC_h5ad$obs)
Now let’s visualize the mapping results. We will do this by saving the data in a Seurat object with mapping results as metadata, running the standard pipeline for creating a UMAP in Seurat, and then color-coding each cell.
# Create the Seurat object
dataSeurat <- CreateSeuratObject(counts = dataQC, meta.data = mapping)
# Standard Seurat pipeline
dataSeurat <- NormalizeData(dataSeurat, verbose = FALSE)
dataSeurat <- FindVariableFeatures(dataSeurat, verbose = FALSE)
dataSeurat <- ScaleData(dataSeurat, verbose = FALSE)
dataSeurat <- RunPCA(dataSeurat, verbose = FALSE)
dataSeurat <- RunUMAP(dataSeurat, dims = 1:10, verbose = FALSE)
Now let’s make the plot for subclasses!
DimPlot(dataSeurat, reduction = "umap", group.by="subclass_name", label=TRUE) + NoLegend()
Here the data has not been clustered but rather the class assignments are assigned to the data in the UMAP. You can see that cells assigned different mapped types localize in different areas of the plot, suggesting that the label transfer works well. It’s worth noting that Seurat produces different UMAP configurations in different R environments, so your plot may not look exactly like this.
Now let’s repeat this analysis where we color-code UMAPs by supertype after separating cells mapped to different classes.
classes <- unique(mapping$class_name)
p <- list()
for (cl in classes){
datClass <- dataSeurat[,mapping$class_name==cl]
datClass <- NormalizeData(datClass, verbose = FALSE)
datClass <- FindVariableFeatures(datClass, verbose = FALSE)
datClass <- ScaleData(datClass, verbose = FALSE)
datClass <- RunPCA(datClass, verbose = FALSE)
datClass <- RunUMAP(datClass, dims = 1:10, verbose = FALSE)
p[[cl]] <- DimPlot(datClass, reduction = "umap", group.by="supertype_name", label=TRUE) + NoLegend()
}
p[["Neuronal: GABAergic"]]
p[["Neuronal: Glutamatergic"]]
p[["Non-neuronal and Non-neural"]]
Once again, the supertypes shown largely segregate from one another without the need to apply clustering, suggesting this mapping works well even at higher resolutions. Scripts such as this can be used for visualizing and comparing mapping results. It’s worth noting that given how different these subclasses are from one another, viewing the data separately for each subclass or more carefully choosing variable genes would likely result in even better visualizations.
To output the session information we write the command, which is useful for reproducibility, especially for more complex scripts.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] hodge2019data_1.0 anndata_0.7.5.6 SeuratObject_5.0.0 Seurat_4.4.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.1-0 deldir_1.0-9
## [4] ellipsis_0.3.2 ggridges_0.5.4 rstudioapi_0.15.0
## [7] spatstat.data_3.0-3 farver_2.1.1 leiden_0.4.3
## [10] listenv_0.9.0 ggrepel_0.9.4 fansi_1.0.5
## [13] codetools_0.2-19 splines_4.2.2 cachem_1.0.8
## [16] knitr_1.45 polyclip_1.10-6 spam_2.10-0
## [19] jsonlite_1.8.7 ica_1.0-3 cluster_2.1.4
## [22] png_0.1-8 uwot_0.1.16 shiny_1.7.5.1
## [25] sctransform_0.4.1 spatstat.sparse_3.0-3 compiler_4.2.2
## [28] httr_1.4.7 assertthat_0.2.1 Matrix_1.6-1.1
## [31] fastmap_1.1.1 lazyeval_0.2.2 cli_3.6.1
## [34] later_1.3.1 htmltools_0.5.6.1 tools_4.2.2
## [37] igraph_1.5.1 dotCall64_1.1-0 gtable_0.3.4
## [40] glue_1.6.2 RANN_2.6.1 reshape2_1.4.4
## [43] dplyr_1.1.3 Rcpp_1.0.11 scattermore_1.2
## [46] jquerylib_0.1.4 vctrs_0.6.4 nlme_3.1-163
## [49] spatstat.explore_3.2-5 progressr_0.14.0 lmtest_0.9-40
## [52] spatstat.random_3.2-1 xfun_0.40 stringr_1.5.0
## [55] globals_0.16.2 mime_0.12 miniUI_0.1.1.1
## [58] lifecycle_1.0.3 irlba_2.3.5.1 goftest_1.2-3
## [61] future_1.33.0 MASS_7.3-60 zoo_1.8-12
## [64] scales_1.2.1 promises_1.2.1 spatstat.utils_3.0-4
## [67] parallel_4.2.2 RColorBrewer_1.1-3 yaml_2.3.7
## [70] reticulate_1.34.0 pbapply_1.7-2 gridExtra_2.3
## [73] ggplot2_3.4.4 sass_0.4.7 stringi_1.7.12
## [76] highr_0.10 rlang_1.1.1 pkgconfig_2.0.3
## [79] matrixStats_1.0.0 evaluate_0.23 lattice_0.22-5
## [82] tensor_1.5 ROCR_1.0-11 purrr_1.0.2
## [85] labeling_0.4.3 patchwork_1.1.3 htmlwidgets_1.6.2
## [88] cowplot_1.1.1 tidyselect_1.2.0 parallelly_1.36.0
## [91] RcppAnnoy_0.0.21 plyr_1.8.9 magrittr_2.0.3
## [94] R6_2.5.1 generics_0.1.3 withr_2.5.2
## [97] pillar_1.9.0 fitdistrplus_1.1-11 survival_3.5-7
## [100] abind_1.4-5 sp_2.1-1 tibble_3.2.1
## [103] future.apply_1.11.0 KernSmooth_2.23-22 utf8_1.2.4
## [106] spatstat.geom_3.2-7 plotly_4.10.3 rmarkdown_2.25
## [109] grid_4.2.2 data.table_1.14.8 digest_0.6.33
## [112] xtable_1.8-4 tidyr_1.3.0 httpuv_1.6.12
## [115] munsell_0.5.0 viridisLite_0.4.2 bslib_0.5.1
Update your browser to view this website correctly.