image representing the current explore topic

MapMyCells: Mapping algorithms

Correlation mapping

Long name: One-step nearest cluster centroid mapping based on correlation

GitHub repository

Short description: Given reference clusters, select the cluster of the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric. 

Long description: For a given taxonomy of count matrix and assigned clusters, each cluster of reference data is summarized by cluster mean and maker genes. The cluster assignment is made for a query data by selecting the cluster of the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric.

Benchmarking: Learn more about the algorithm's performance on curated benchmark datasets.

Recommended usage: This is a robust option when your data was generated by the same sequencing platform than the reference data you’re mapping against. It’s also highly robust for most single-cell RNA-Seq (10X v3, 10X v2, SMART-Seq) and MERFISH datasets. Faster and simpler computation than other methods.

 


 

Hierarchical correlation mapping

Long name: Hierarchical nearest cluster centroid mapping based on correlation

GitHub repository

Short description: Given the hierarchy tree of reference clusters, traverse the tree down to the terminal cluster selecting the branch with the minimal distance to the query data using pre-calculated marker genes with correlation as distance metric.

Long description: The reference data consists of a set of cells, their cluster assignments, and a tree-like taxonomy of cell types (e.g. class, subclass, supertype, cluster). Each cell type cluster is characterized by the mean gene expression profile of the cells in that cluster. Each parent node in the taxonomy has a pre-selected set of marker genes used for distinguishing between its children. To assign an unlabeled cell to a cell type, we traverse the tree, starting at the root node. We use the root node marker genes to select a cell type class by comparing the unlabeled cell’s gene expression profile to the mean gene expression profile of each class in those marker genes, selecting the class with the highest correlation to the unlabeled cell. For robustness, we repeat this process 100 times, using a random 90% of the marker genes each time, and assign the unlabeled cell to the class that receives the plurality of votes. We repeat this process, selecting only among the chosen class’s children, using the chosen class’s marker genes, and again iterating 100 times to choose a subclass. We continue down the taxonomy tree in this fashion until we have chosen a leaf node (e.g. a cluster).

Benchmarking: Learn more about the algorithm's performance on curated benchmark datasets.

Marker genes: The mapping result's JSON output file contains a list of marker genes. Learn more here

Recommended usage: Preferred by Allen Institute scientists, this is a robust option when your data was generated by a different sequencing platform than the reference data you’re mapping against. Also highly robust for single-nucleus RNA-seq datasets. Slower than some other methods. This mapping algorithm is recommended for cross-species and cross-platform.

 


 

Deep generative mapping

Long name: Deep hierarchical algorithm for cell type annotation based on conditional variational autoencoders

GitHub repository: Coming soon.

Citation: Gabitto M.I. and Travaglini K.J., et al. Integrated Multimodal Cell Atlas of Alzheimer’s Disease. Nat Neurosci (2024). Available from: https://doi.org/10.1038/s41593-024-01774-5.

Short description: Deep Generative Mapping is a deep generative model algorithm for mapping snRNA-seq data sets and putting those data into the same latent space as data from Allen Institute-hosted reference taxonomies. 

Long description: Deep Generative Mapping is a deep generative model algorithm for mapping snRNA-seq data sets and assigning cell type identity, based on Allen Institute-hosted reference taxonomies. Cells mapped with this method are not only assigned an associated cell type, but also associated confidence on mapping, and additional information for visualizing and quantifying the level of agreement. Future versions of this application will allow mapping of snATAC-seq data, MERFISH data, and other modalities. 

Benchmarking: Algorithm performance on 10x Human MTG SEA-AD (CCN20230505) data with 80% training and 20% as evaluation:

- Supertype level (for testing set): mean F1=0.86, median F1=0.90, accuracy=0.91.

- Subclass-level: mean F1=0.99, median F1 = 0.996, accuracy=0.995. 

Confidence score explanation: The deep generative model is based on a variational autoencoder neural network, which first embeds the original input data into latent space and then trains a multilayer perceptron (MLP) model to further classify the labels. Each label is represented by a node as an output. In the last layer of the MLP classifier, there is a SoftMax layer that normalizes across all labels into probabilities. In this context, we use the SoftMax value of the predicted label as the confidence score provided in the CSV file. Although it is typically overconfident, it can still indicate the degree to which the input is associated with a certain supertype.

Recommended usage: Run application on novel snRNA-seq data set generated in human tissue to assign cellular identity, and associated confidence intervals on mapping, based on Allen Institute-hosted reference taxonomies. These taxonomies are defined in the Human and Mammalian Brain Atlas (HMBA) and Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD) consortiums.