image representing the current explore topic

MapMyCells: Cell type references, algorithms, and output files

Cell type references

10x Whole mouse brain taxonomy (CCN20230722)

Name: A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain

Species: Mouse

Description: This is a comprehensive and high-resolution transcriptomic and spatial cell type atlas for the whole adult mouse brain. The cell type atlas was created based on the combination of two single-cell-level, whole-brain-scale datasets: a single-cell RNA-sequencing (scRNA-seq) dataset of ~7 million cells profiled (~ 4 million cells after QC), and a spatially resolved transcriptomic dataset of ~4 million cells using MERFISH. The atlas is hierarchically organized into four nested levels of classification: 34 classes, 338 subclasses, 1,201 supertypes and 5,322 clusters. We systematically analyzed the neuronal, non-neuronal, and immature neuronal cell types across the brain and identified a high degree of correspondence between transcriptomic identity and spatial specificity for each cell type.

Additional resources: Download the cell type annotation table. Explore the taxonomy via these notebooks.

Citation: Yao Z., et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. bioRxiv. bioRxiv; 2023. Available from: https://doi.org/10.1101/2023.03.06.531121

 

10x Human MTG SEA-AD taxonomy (CCN20230505)

Name: A high-resolution transcriptomic atlas of cell types from middle temporal gyrus from the SEA-AD aged human cohort that spans the spectrum of Alzheimer’s disease. 

Species: Human

Description: This is a high-resolution transcriptomic cell type atlas from middle temporal gyrus from the SEA-AD aged human cohort that spans the spectrum of Alzheimer’s disease. The atlas was created from a single nucleus RNA-sequencing (snRNAseq) dataset of ~2.0 million cells profiled (~1.4 million cells after QC). The atlas is hierarchically organized into nested levels of classification: 3 classes, 24 subclasses, and 139 Supertypes.

Additional resources: 

- Download the raw data at Sage Bionetworks’ AD Knowledge Portal 

- Download the processed/annotated data from AWS Open Data 

- Explore the data at SEA-AD.org 

Citation: Gabitto M. and Travaglini K., et al. Integrated Multimodal Cell Atlas of Alzheimer’s Disease. BioRxiv. BioRxiv; 2023. Available from: https://doi.org/10.1101/2023.05.08.539485 

Mapping algorithms

Correlation mapping

Long name: One-step nearest cluster centroid mapping based on correlation

GitHub repository

Short description: Given reference clusters, select the cluster of the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric. 

Long description: For a given taxonomy of count matrix and assigned clusters, each cluster of reference data is summarized by cluster mean and maker genes. The cluster assignment is made for a query data by selecting the cluster of the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric.

Benchmarking: Learn more about the algorithm's performance on curated benchmark datasets.

Recommended usage: This is a robust option when your data was generated by the same sequencing platform than the reference data you’re mapping against. It’s also highly robust for most single-cell RNA-Seq (10X v3, 10X v2, SMART-Seq) and MERFISH datasets. Faster and simpler computation than other methods.

Hierarchical correlation mapping

Long name: Hierarchical nearest cluster centroid mapping based on correlation

GitHub repository

Short description: Given the hierarchy tree of reference clusters, traverse the tree down to the terminal cluster selecting the branch with the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric.

Long description: The reference data consists of a set of cells, their cluster assignments, and a tree-like taxonomy of cell types (e.g. class, subclass, supertype, cluster). Each cell type cluster is characterized by the mean gene expression profile of the cells in that cluster. Each parent node in the taxonomy has a pre-selected set of marker genes used for distinguishing between its children. To assign an unlabeled cell to a cell type, we traverse the tree, starting at the root node. We use the root node marker genes to select a cell type class by comparing the unlabeled cell’s gene expression profile to the mean gene expression profile of each class in those marker genes, selecting the class with the highest correlation to the unlabeled cell. For robustness, we repeat this process 100 times, using a random 90% of the marker genes each time, and assign the unlabeled cell to the class that receives the plurality of votes. We repeat this process, selecting only among the chosen class’s children, using the chosen class’s marker genes, and again iterating 100 times to choose a subclass. We continue down the taxonomy tree in this fashion until we have chosen a leaf node (e.g. a cluster).

Benchmarking: Learn more about the algorithm's performance on curated benchmark datasets.

Marker genes: The mapping result's JSON output file contains a list of marker genes. Learn more here

Recommended usage: Preferred by Allen Institute scientists, this is a robust option when your data was generated by a different sequencing platform than the reference data you’re mapping against. Also highly robust for single-nucleus RNA-seq datasets. Slower than some other methods. This mapping algorithm is recommended for cross-species and cross-platform.

Deep generative mapping

Long name: Deep hierarchical algorithm for cell type annotation based on conditional variational autoencoders

GitHub repository: Coming soon.

Citation: Gabitto MI, Travaglini KJ, et al. Integrated multimodal cell atlas of Alzheimer's disease. Res Sq [Preprint]. 2023 May 23:rs.3.rs-2921860. doi: 10.21203/rs.3.rs-2921860/v1. PMID: 37292694; PMCID: PMC10246227.

Short description: Deep Generative Mapping is a deep generative model algorithm for mapping snRNA-seq data sets and putting those data into the same latent space as data from Allen Institute-hosted reference taxonomies. 

Long description: Deep Generative Mapping is a deep generative model algorithm for mapping snRNA-seq data sets and assigning cell type identity, based on Allen Institute-hosted reference taxonomies. Cells mapped with this method are not only assigned an associated cell type, but also associated confidence on mapping, and additional information for visualizing and quantifying the level of agreement. Future versions of this application will allow mapping of snATAC-seq data, MERFISH data, and other modalities. 

Benchmarking: Algorithm performance on 10x Human MTG SEA-AD (CCN20230505) data with 80% training and 20% as evaluation:

- Supertype level (for testing set): mean F1=0.86, median F1=0.90, accuracy=0.91.

- Subclass-level: mean F1=0.99, median F1 = 0.996, accuracy=0.995. 

Confidence score explanation: The deep generative model is based on a variational autoencoder neural network, which first embeds the original input data into latent space and then trains a multilayer perceptron (MLP) model to further classify the labels. Each label is represented by a node as an output. In the last layer of the MLP classifier, there is a SoftMax layer that normalizes across all labels into probabilities. In this context, we use the SoftMax value of the predicted label as the confidence score provided in the CSV file. Although it is typically overconfident, it can still indicate the degree to which the input is associated with a certain supertype.

Recommended usage: Run application on novel snRNA-seq data set generated in human tissue to assign cellular identity, and associated confidence intervals on mapping, based on Allen Institute-hosted reference taxonomies. These taxonomies are defined in the Human and Mammalian Brain Atlas (HMBA) and Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD) consortiums.

Algorithm output files

MapMyCells produces two output files. A “standard” CSV output file and an “extended” JSON output file. These files are archived into a single .zip file for download. Modern operating systems all natively support unpacking zip files, usually via a right-click + "Extract all" command.

 

  • validation_log.txt: Log of messages produced by job. Even returned for failed jobs. Useful for debugging.

  • my_job.csv: Returned by all algorithms. CSV table of mapping results

  • my_job.json: Only returned by Hierarchical and Flat mapping. More detailed results and metadata stored in a JSON file.

  • my_job_summary_metadata.json: JSON file recording number of cells mapped to cell types and number of genes mapped to Ensembl IDs.

 

To extract the individual files in the command line run

tar -xf path/to/downloaded/file.zip

at which point, the constituent CSV and JSON files should appear in your current working directory.

 

Alternatively, run

tar -xvf my_tar_file.zip --directory my_directory

to unpack the files to an existing directory of your choice, e.g. my_directory.

 

The contents of these files are documented in detail here: https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md 

 

At a high level, suffice it to say that, modulo a few lines of metadata prefixed with a ‘#’, the CSV file is meant to be read into a dataframe as in (for Python)

import pandas
data_frame = pandas.read_csv(‘path/to/output.csv’, comment=’#’)

or an Excel spreadsheet. The JSON file is the serialized representation of a dict with more detailed results for those comfortable with deserializing JSON blobs. The JSON file is also where the metadata associated with the mapping run is saved.