MapMyCells Use Case: Single Nucleus RNA-seq from Human MTG

Overview

Click here to download the full R notebook for your own use

MapMyCells enables mapping of single cell and spatial trancriptomics data sets to a human middle temporal gyrus (MTG) taxonomy. The taxonomy is derived and presented in “Integrated multimodal cell atlas of Alzheimer’s disease” (https://doi.org/10.1101/2023.05.08.539485), and we encourage you to cite this work if you use MapMyCells to transfer these labels to your data. It is also used throughout the web tools on SEA-AD.org. This R workbook illustrates a common use case for the MapMyCells facility (https://knowledge.brain-map.org/mapmycells/process/) and follow up analyses. The query data included in this document are from the paper “Conserved cell types with divergent features in human versus mouse cortex ” (https://doi.org/10.1038/s41586-019-1506-7) from the Allen Institute, which also has a nice data exploration tool (RNA-Seq Data Navigator; https://celltypes.brain-map.org/rnaseq/human/mtg).

Prepare your workspace

Install R and (optionally RStudio)

This example is run and R, which can be downloaded at CRAN (https://cran.r-project.org/). To run in R, just sequentially copy and paste the relevant code blocks into R. An easier alternative is to download RStudio here (https://posit.co/download/rstudio-desktop/) after downloading R. “.Rmd” files can be directly loaded into R studio and run.

Download the query data

As an example data set, we will use the a subset of the snRNA-seq dataset published in Hodge, et al. (2019), which has been conveniently packaged into an R library. (To see how one would perform this step from a 10X experiment, see the version of this tutorial for whole mouse brain.)

if(!is.element("hodge2019data",.packages(all.available = TRUE)))  # Install data package if not already installed
  devtools::install_github("AllenInstitute/hodge2019data")

Set the working directory

The next component of set up is to make sure your R working directory points to the data you are reading in.

# Uncomment line below and replace bracketed text with path to downloaded files.
#setwd("FILE_PATH")

Load the relevant libraries

This workbook uses the libraries anndata, Seurat, and hodge2019data.

# Install Seurat and anndata if needed
list.of.packages <- c("Seurat", "anndata")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)>0) install.packages(new.packages)

suppressPackageStartupMessages({
  library(Seurat)        # For reading droplet data sets and visualizing/comparing results
  library(anndata)       # For writing h5ad files
  library(hodge2019data) # For the query data set
})

## Warning: package 'Seurat' was built under R version 4.2.3

## Warning: package 'anndata' was built under R version 4.2.3

options(stringsAsFactors=FALSE)

# Citing R libraries
# citation("Seurat") # Note that the citation for any R library can be pulled up using the citation command. We encourage citation of R libraries as appropriate.

With the above libraries loaded, the use case below can now be run.

Analysis of snRNA-seq data

(If you already have an .h5ad file ready for upload, skip to step 4.)

A common use case for analysis of single cell/nucleus transcriptomics is to collect data from several thousand cells, either though one (or more) ports of a droplet-based scRNA-seq run or using single well per cell SMART-Seq sequencing. After some QC steps, these are then clustered for defining cell types. This section describes how to start from such a data set, transfer labels from human “Human-MTG-10x_SEA-AD” data from the Allen Institute / SEA-AD onto these cells using MapMyCells, and then visualize the results in a UMAP.

1. Read query data

With the data library loaded, these are already read in. We just rename for convenience.

dataIn <- data_Hodge2019        # These are the counts
annoIn <- metadata_Hodge2019    # These are the metadata (e.g., cluster assignments)
dim(dataIn)

## [1] 36591  3357

This reads in a data matrix with 3,357 nuclei.

2. QC data

The above data has already been carefully QC’ed, and therefore a QC step is unnecessary. However, in a typical experiment a QC step would be needed at this point. For demonstrative purposes, we will define all barcodes with >250 reads as interesting “cells”, but in a real experiment more careful QC is strongly encouraged.

dataQC <- dataIn[,colSums(dataIn)>250]
dim(dataQC)

## [1] 36591  3357

This step retains all cells for this example, but in a droplet-based experiment it would typically bring the input library down from a huge number to a more reasonable number.

3. Output to h5ad format

Now let’s output the QC’ed data matrix into an h5ad file and output it to the current directory for upload to MapMyCells. Note that in anndata data structure the genes are saved as columns rather than genes so we need to transpose the matrix first.

# Transpose data
dataQCt = Matrix::t(dataQC)

# Convert to anndata format
ad <- AnnData(
  X = dataQCt,
  obs = data.frame(group = rownames(dataQCt), row.names = rownames(dataQCt)),
  var = data.frame(type = colnames(dataQCt), row.names = colnames(dataQCt))
)

# Write to compressed h5ad file
write_h5ad(ad,'Hodge2019.h5ad',compression='gzip')

# Check file size. File MUST be <500MB to upload for MapMyCells
print(paste("Size in MB:",round(file.size("Hodge2019.h5ad")/2^20)))

## [1] "Size in MB: 73"

If you have trouble accessing or running the previous block, check your Python accessibility in R and consider installing python as shown below. If you use Windows, you may be asked to install git before installing python, which can be accessed here: https://git-scm.com/download/win/.

library(reticulate)
version <- "3.9.12"
install_python(version)
virtualenv_create("my-environment", version = version)
use_virtualenv("my-environment")

4. Assign cell types using MapMyCells

These next steps are performed OUTSIDE of R in the MapMyCells web application.

The steps to MapMyCells are as follows:

Go to (https://knowledge.brain-map.org/mapmycells/process/).
(Optional) Log in to MapMyCells.
Upload ‘Hodge2019.h5ad’ to the site via the file system or drag and drop (Step 1).
Choose “10x Human MTG SEA-AD (CCN20230505)” as the “Reference Taxonomy” (Step 2).
Choose the desired “Mapping Algorithm” (in this case “Deep Generative Mapping”).
Click “Start” and wait ~5 minutes. (Optional) You may have a panel on the left that says “Map Results” where you can also wait for your run to finish.
When the mapping is complete, you will have an option to download the “zip” file with mapping results. If your browser is preventing popups, search for small folder icon to the right URL address bar to enable downloads.
This download file (and the files therein) will have a long name. As of this writing the name is of the format (Upload file name)(Taxonomy name(id))(Mapping algorithm)(Unique run ID). Unzip this file, which will contain three or four files: “validation_log.txt”, “[long_name].json”, and “[long_name].csv”, and maybe [long_name]_summary_metadata.json. [long_name].csv contains the mapping results, which you need. The validation log will give you information about the the run itself and the json files will give you extra information about the mapping. You can ignore the latter files if the run completes successfully and if you are not a power user.
Copy [long_name].csv to your current working directory and rename it “Hodge2019_mapping.csv”).
You can now go back to R and continue the script below.

5. Read mapping results into R

Let’s now look at the output results from the hierarchical clustering algorithm. We can read this into R using read.csv, but note that the first few lines contain metadata that need to be skipped, and begin with a ‘#’ character.

mapping <- read.csv("Hodge2019_mapping.csv",comment.char="#")
head(data.frame(mapping))

	cell_id	class_label	class_name	class_bootstrapping_probability
1	LS-15093_S11_E1-50	CS20230505_CLAS_0001	Neuronal: GABAergic	1
2	LS-15093_S13_E1-50	CS20230505_CLAS_0002	Neuronal: Glutamatergic	1
3	LS-15093_S21_E1-50	CS20230505_CLAS_0002	Neuronal: Glutamatergic	1
4	LS-15093_S26_E1-50	CS20230505_CLAS_0002	Neuronal: Glutamatergic	1
5	LS-15093_S37_E1-50	CS20230505_CLAS_0002	Neuronal: Glutamatergic	1
6	LS-15093_S39_E1-50	CS20230505_CLAS_0002	Neuronal: Glutamatergic	1

6 rows | 1-5 of 11 columns [This notebook preview is limited to the first 5 columns. Run the notebook in your own environment to see all columns.]

In this taxonomy, MapMyCells maps input cells to the taxonomy at three increasing levels of resolution from coarsest class, to intermediate subclass, and finest supertype. In this human SEA-AD taxonomy there are 3 classes, 24 subclasses, and 139 supertypes.

The file consists of the following columns:

cell_id = the cell identifiers for your cells included in the initial h5ad files
class_label = the unique identifier for the best mapping class
class_name = the name of the best mapping class
class_softmax_probability = a score indicating the confidence that the cell mapped to the above class is accurately mapped (higher numbers are better, with 1 being the best; note that other mapping algorithms have different types of confidence scores with different names for this column)
and similar fields for subclass, supertype, and cluster. For finest level cluster there is an additional field

6. Compare subclass and cluster assignments

As this data set is previously published, we already know the expected subclasses for each cell. Let’s check how well the mapped and clustered subclass calls match. Recall that we defined metadata as “annoIn” above. (This step will only work if your query data has previously assigned cell types, and may need slight modifications.)

comparison = table(mapping$subclass_name,annoIn$subclass)
heatmap(comparison, ylab = "Mapping", xlab = "Clustering",margins = c(10,10))

There is extremely good agreement, with most of the boxes showing all (red) or none (yellow) of the cells mapped to the same SEA-AD subclass coming from a given Hodge subclass call. (It turns out that the disagreements are largely due to differences in how subclasses are named or how their borders are defined between 2019 and 2023.)

Using the same strategy, we can see how clusters from Hodge et al 2019 correspond to supertypes from SEA-AD.

comparison = table(mapping$supertype_name,annoIn$cluster_label)
heatmap(comparison, ylab = "Mapping", xlab = "Clustering",margins = c(8,8), cexRow = 0.5, cexCol = 0.5)

To do this properly would require more complicated analysis, but the main take-home here is that there is strong correspondence between initial cluster calls and mapping supertype assignments (lots of red and yellow, minimal intermediate colors).

7. Visualize mapping results

To visualize the mapping results, we need both the mapping results and the original query cellxgene matrix for comparison. If this is not already read in, you can read it in from the anndata object uploaded to MapMyCells.

# Since the query data corresponds to dataQC above, we will call it dataQC again
dataQC_h5ad <- read_h5ad('Hodge2019.h5ad')
dataQC <- t(as.matrix(dataQC_h5ad$X))
rownames(dataQC) <- rownames(dataQC_h5ad$var)
colnames(dataQC) <- rownames(dataQC_h5ad$obs)

Now let’s visualize the mapping results. We will do this by saving the data in a Seurat object with mapping results as metadata, running the standard pipeline for creating a UMAP in Seurat, and then color-coding each cell.

# Create the Seurat object
dataSeurat <- CreateSeuratObject(counts = dataQC, meta.data = mapping)

# Standard Seurat pipeline
dataSeurat <- NormalizeData(dataSeurat, verbose = FALSE)
dataSeurat <- FindVariableFeatures(dataSeurat, verbose = FALSE)
dataSeurat <- ScaleData(dataSeurat, verbose = FALSE)
dataSeurat <- RunPCA(dataSeurat, verbose = FALSE)
dataSeurat <- RunUMAP(dataSeurat, dims = 1:10, verbose = FALSE)

Now let’s make the plot for subclasses!

DimPlot(dataSeurat, reduction = "umap", group.by="subclass_name", label=TRUE) + NoLegend()

Here the data has not been clustered but rather the class assignments are assigned to the data in the UMAP. You can see that cells assigned different mapped types localize in different areas of the plot, suggesting that the label transfer works well. It’s worth noting that Seurat produces different UMAP configurations in different R environments, so your plot may not look exactly like this.

Now let’s repeat this analysis where we color-code UMAPs by supertype after separating cells mapped to different classes.

classes <- unique(mapping$class_name)
p       <- list()
for (cl in classes){
  datClass <- dataSeurat[,mapping$class_name==cl]
  datClass <- NormalizeData(datClass, verbose = FALSE)
  datClass <- FindVariableFeatures(datClass, verbose = FALSE)
  datClass <- ScaleData(datClass, verbose = FALSE)
  datClass <- RunPCA(datClass, verbose = FALSE)
  datClass <- RunUMAP(datClass, dims = 1:10, verbose = FALSE)
  p[[cl]]  <- DimPlot(datClass, reduction = "umap", group.by="supertype_name", label=TRUE) + NoLegend()
}
p[["Neuronal: GABAergic"]]

p[["Neuronal: Glutamatergic"]]

p[["Non-neuronal and Non-neural"]]

Once again, the supertypes shown largely segregate from one another without the need to apply clustering, suggesting this mapping works well even at higher resolutions. Scripts such as this can be used for visualizing and comparing mapping results. It’s worth noting that given how different these subclasses are from one another, viewing the data separately for each subclass or more carefully choosing variable genes would likely result in even better visualizations.

To output the session information we write the command, which is useful for reproducibility, especially for more complex scripts.

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hodge2019data_1.0  anndata_0.7.5.6    SeuratObject_5.0.0 Seurat_4.4.0      
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.16             colorspace_2.1-0       deldir_1.0-9          
##   [4] ellipsis_0.3.2         ggridges_0.5.4         rstudioapi_0.15.0     
##   [7] spatstat.data_3.0-3    farver_2.1.1           leiden_0.4.3          
##  [10] listenv_0.9.0          ggrepel_0.9.4          fansi_1.0.5           
##  [13] codetools_0.2-19       splines_4.2.2          cachem_1.0.8          
##  [16] knitr_1.45             polyclip_1.10-6        spam_2.10-0           
##  [19] jsonlite_1.8.7         ica_1.0-3              cluster_2.1.4         
##  [22] png_0.1-8              uwot_0.1.16            shiny_1.7.5.1         
##  [25] sctransform_0.4.1      spatstat.sparse_3.0-3  compiler_4.2.2        
##  [28] httr_1.4.7             assertthat_0.2.1       Matrix_1.6-1.1        
##  [31] fastmap_1.1.1          lazyeval_0.2.2         cli_3.6.1             
##  [34] later_1.3.1            htmltools_0.5.6.1      tools_4.2.2           
##  [37] igraph_1.5.1           dotCall64_1.1-0        gtable_0.3.4          
##  [40] glue_1.6.2             RANN_2.6.1             reshape2_1.4.4        
##  [43] dplyr_1.1.3            Rcpp_1.0.11            scattermore_1.2       
##  [46] jquerylib_0.1.4        vctrs_0.6.4            nlme_3.1-163          
##  [49] spatstat.explore_3.2-5 progressr_0.14.0       lmtest_0.9-40         
##  [52] spatstat.random_3.2-1  xfun_0.40              stringr_1.5.0         
##  [55] globals_0.16.2         mime_0.12              miniUI_0.1.1.1        
##  [58] lifecycle_1.0.3        irlba_2.3.5.1          goftest_1.2-3         
##  [61] future_1.33.0          MASS_7.3-60            zoo_1.8-12            
##  [64] scales_1.2.1           promises_1.2.1         spatstat.utils_3.0-4  
##  [67] parallel_4.2.2         RColorBrewer_1.1-3     yaml_2.3.7            
##  [70] reticulate_1.34.0      pbapply_1.7-2          gridExtra_2.3         
##  [73] ggplot2_3.4.4          sass_0.4.7             stringi_1.7.12        
##  [76] highr_0.10             rlang_1.1.1            pkgconfig_2.0.3       
##  [79] matrixStats_1.0.0      evaluate_0.23          lattice_0.22-5        
##  [82] tensor_1.5             ROCR_1.0-11            purrr_1.0.2           
##  [85] labeling_0.4.3         patchwork_1.1.3        htmlwidgets_1.6.2     
##  [88] cowplot_1.1.1          tidyselect_1.2.0       parallelly_1.36.0     
##  [91] RcppAnnoy_0.0.21       plyr_1.8.9             magrittr_2.0.3        
##  [94] R6_2.5.1               generics_0.1.3         withr_2.5.2           
##  [97] pillar_1.9.0           fitdistrplus_1.1-11    survival_3.5-7        
## [100] abind_1.4-5            sp_2.1-1               tibble_3.2.1          
## [103] future.apply_1.11.0    KernSmooth_2.23-22     utf8_1.2.4            
## [106] spatstat.geom_3.2-7    plotly_4.10.3          rmarkdown_2.25        
## [109] grid_4.2.2             data.table_1.14.8      digest_0.6.33         
## [112] xtable_1.8-4           tidyr_1.3.0            httpuv_1.6.12         
## [115] munsell_0.5.0          viridisLite_0.4.2      bslib_0.5.1

MapMyCells Use case: single nucleus RNA-seq from human MTG