Cell Type Nomenclature

Background

The challenge of a colloquial and intuitive naming convention is problematic for any ontology or taxonomy. Yet it is necessary to have an accurate and flexible nomenclature to describe and analyze existing and upcoming data that is at the cellular level. The Cell Type Ontology Workshop (held at the Allen Institute in Seattle, June 17-18, 2019) convened experts in the fields of ontology, taxonomy, and neuroscience to make recommendations, highlight best practices and propose conventions for naming cell types. As an outcome of this workshop, a framework for developing nomenclatures—called the Common Cell Type Nomenclature, or CCN—was constructed.

An overview of this framework is presented here. Details of the CCN (version 12-2020) are described in a publication in eLife by Miller, et al. (2020). Step-by-step instructions and associated code for applying the CCN to any cell typing analysis can be found on GitHub. Finally, community input on all aspects of this framework can be provided in the Allen Brain Map Community Forum. An archived version of this page containing the initial version of the CCN (version 10-2019) is available here.

Overview

Identifying and naming brain cells has been an integral part of neuroscience for a century. Many neuronal cell types such as neurogliaform, chandelier, Martinotti, and pyramidal cells, have been identified based on highly distinct shape, location, or electrical properties, providing consistent classifications and a common vocabulary. More recently, single cell RNA sequencing (RNA-Seq) has led to an explosion of quantitative cell type definitions across multiple organs and organisms with widely varying nomenclature and limited alignment between taxonomies.

To facilitate cross-dataset comparison, the Allen Institute created the CCN for matching and tracking cell types across studies. A primary objective of the CCN was to develop a system that would be straightforward and allow designation of cell types with or without hierarchical organization. Although application of this nomenclature convention is initially focused on select examples in brain, it is applicable to new or established taxonomies derived from diverse quantifiable modalities, using multiple data sources, species, and organ systems, and is intended to encompass existing naming strategies used in publications across diverse research teams. This convention was motivated by methodologies used for management of gene transcript identity tracked across different versions of GENCODE genome builds, allowing comparison of matched types with a common reference or any other taxonomy. Motivated by gene nomenclature conventions from HGNC (Bruford et al., 2020), the CCN also facilitates assigning accurate yet flexible cell type names in the mammalian cortex as a step toward community-wide efforts to organize multi-source, data-driven information related to cell type taxonomies from any organism. This website describes version 07-2021 of the CCN (which has only minor updates from version 12-2020 published in eLife).

Cell Type Tracking Schema

The CCN schema consists of two primary components, both of which are assigned unique IDs, aliases, and metadata that track provenance and align information between studies. First, a taxonomy (below left) is defined as the set of quantitatively derived data clusters (or provisional cell types) produced by running a specific computational algorithm on a specific dataset(s). Stated differently, a taxonomy is a system for tracking structured tertiary analysis of data. Second, a cell set (below right) is defined as any tagged group of cells in a taxonomy, although these most commonly represent individual provisional cell types or sets of provisional cell types (e.g., all glutamatergic cell types).

The CCN schema treats all cell sets identically regardless of origin but provides specific alias tags for linking cell sets within and between taxonomies. For example, cell set labels track groups of cell types within a single taxonomy that could span many organ systems, whereas cell set aliases can provide both data-driven and common-usage cell type names for public consumption (e.g., in manuscripts). In particular, some cell sets can be assigned a cell set aligned alias, which is a biologically driven name for linking matching cell sets across taxonomies (e.g., "L5 ET") intended to be analogous to “gene symbol.” A glossary of taxonomy and cell set tags and other relevant terms is presented at the end of this page.

Naming Convention

Mammalian brain cell types inhabit a complex landscape with fuzzy boundaries and complicated correspondences between species and modalities, leading to a variety of disparate solutions for naming cell types. Thus, a challenging and potentially contentious question in cell type classification is how these newly identified cell types should be named. The CCN utilizes a strategy for naming cell types in the mammalian cortex that includes properties that are cell intrinsic and potentially well conserved between species. While admittedly underdeveloped, this convention has been applied to multiple studies of the primary motor cortex and represents only a starting point for discussion.

Class	Format	Example
Glutamatergic	[Layer] [Projection] #	L2/3 IT 4
GABAergic	[Canonical gene] #	Pvalb 3
Non-neuronal	[Cell class] #	Microglia 2
(alternative)	[Historical name]	Chandelier

For glutamatergic neurons, cell types are named based on predominant layer(s) of localization of cell body (soma) and their predicted projection patterns. The relatively robust laminarity of glutamatergic cell types has been described based on cytoarchitecture in multiple mammalian species for many years and projection targets for cell types have been explicitly measured in mouse primary visual cortex using a combination of retrograde labeling and scRNA-seq (Tasic et al., 2018; Tasic et al., 2016). For GABAergic interneurons, developmental origin may define cell types by their canonical marker gene profile established early in development, with Pvalb and Sst primarily labeling cell types derived from the medial ganglionic eminence and Vip, Sncg, and Lamp5 primarily labeling cell types derived from the caudal ganglionic eminence (DeFelipe et al., 2013). Non-neuronal cell types have not been a focus of the studies cited and hence they are labeled at a broad cell type level only. Feedback on how to extend this convention to other brain structures or other organs can be provided via the Allen Brain Map Community Forum.

Alignment of Established Cell Sets Using Reference Taxonomies

There is compelling evidence for the existence of distinct cell types based on robust groupings of cells by observable and measurable cell attributes. Given the evidence of generally close correlation between molecular, physiological, and morphological cell characteristics noted above, the field is moving toward the notion of defining a reference cell type classification system based on the clustering of single cell/nucleus transcriptomic (scRNA-Seq) data as an initial framework, and then layering on additional phenotypic data as they become available, especially from multimodal assays (e.g., Patch-Seq). Although the primary focus here is nomenclature, a multi-staged analysis workflow that integrates versioned reference taxonomies and the CCN is presented below, and further details are available in the eLife publication by Miller, et al. (2020). This organizational framework has been applied to cell data generated and released by the Allen Institute for Brain Science.

Applying the CCN to existing and new datasets

Any nomenclature schema is only useful if it is adopted by many scientists studying cell types across multiple institutes and organ systems. Therefore, this schema is intended to be easy to understand, lightweight, and easy to apply to novel or published data sets, and both tools and examples for how this nomenclature can be applied to any cell-level data are provided. Code for applying this nomenclature is available in the `allen_institute_nomenclature` github repository. An example of how this schema can be applied to an existing data set is shown below.

In a study of mouse primary visual cortex using ~1700 cells, a total of 42 neuronal and 7 non-neuronal cell types were proposed, and cell type-specific mRNA processing and genetic access to these transcriptomic types was accomplished through the use of many transgenic Cre lines. This work was published in Nature in 2016, and the data can be browsed with an interactive navigation application. The above figure shows the hierarchical organization of cell types and cell type aliases as initially described, and cell type labels and accession IDs are appended for each cell type after application of the CCN (right columns). Annotations are also shown for a subset of the internal nodes with a cell set alias, label, and accession ID, as described. Note that by default, node aliases are left blank (e.g. nodes C and F), but nodes representing a useful collection of cell types can be manually tagged with an alias (e.g., node "B" contains all Pvalb types and no others). Details and code for reproducing this analysis are in the `allen_institute_nomenclature` github repository.

Five additional use cases are presented in the publication in eLife for application of the CCN to existing and new datasets. These include 1) alignment of a human snRNA-seq taxonomy in MTG to an M1 reference, 2) application to a morphology and electrophysiology taxonomy in mouse VISp, 3) exploration of 'Sst Chodl' class persistence across multimodal phenotypes and developmental stages, 4) alignment of cell types from reptilian and mammalian cortex, and 5) comparison of a novel to 18 existing taxonomies.

Developing the CCN as a community standard

The complexity of cell types taxonomies and their generation now requires conventions and methodology to capture and communicate essential knowledge derived from experiments. The CCN provides a schema and workflow that allows scientists to organize their cell types within a single dataset and to link taxonomies using the aligned alias and other alias terms. However, the CCN is currently a stand-alone nomenclature schema that lacks the centralization and governance of gene-based standards proposed by the HUGO Gene Nomenclature Committee (HGNC) (Bruford et al., 2020) and does not yet have a mechanism for integrating with underlying data and metadata.

These shortcomings are currently being addressed through several strategies. First, the CCN is being integrated into the workflows and tools of large cell-typing consortia such as the BRAIN Initiative - Cell Census Network (BICCN). For example, a series of linked multimodal studies in primary motor cortex (summarized in the "flagship paper") all apply the CCN in their space. The Human Biomolecular Atlas Project (HuBMAP) has incorporated the nomenclature into their ontology explorer, and efforts are underway to ensure compatibility of the CCN with the cell annotation platform developed by the Human Cell Atlas (HCA), whose goal is to build a atlas of all cells in the human body, and which already has many of the required mechanisms in place for governance of this cell type classification workflow. Second, the CCN has been submitted to INCF for consideration to include in the Standards and Best Practices portfolio. Third, cell set aligned alias terms were chosen to match ontologies such as the Cell Ontology (CL) when possible, and efforts are underway to extend the CL as needed. Fourth, a cell type standard governing body would ideally be responsible for vetting ontologies for organizing data, controlled vocabulary for assigning cell type nomenclature, and will need to define a process for submission to ensure that critical data and metadata can be stored in a robust database. Cross-institute working groups from BRAIN Initiative-funded initiatives have begun organizing such a governance framework. Finally, the CCN has been shared with the scientific and general community in many forums. While great progress has been made, the CCN is still far from a true community standard and any suggestions on this topic are strongly encouraged at the Allen Brain Map Community Forum.

Future development and gathering community input

This work presents a framework that is a modest step in a long and iterative process involving a wide community, and which will be refined in the months and years to come through cross-disciplinary partnership. Incorporating community input on the definition and management of cell type standards will be necessary as new experiments are performed and additional evidence emerges. Efforts to extend this framework to include directional cell types and continuous sources of variation will be needed to properly capture the landscape of development, aging, and disease. The Allen Brain Map Community Forum is a venue available for discussion related to cell taxonomies as a starting point for exchanging ideas and input from the community in an open way. Please provide feedback on this nomenclature schema and share examples of this schema applied to your own data sets.

References

Boldog, E., Bakken, T., Hodge, R, et al.. Transcriptomic and morphophysiological evidence for a specialized human cortical GABAergic cell type. Nat Neurosci 21, 1185–1195 (2018). https://doi.org/10.1038/s41593-018-0205-2
BRAIN Initiative Cell Census Network (BICCN). A multimodal cell census and atlas of the mammalian primary motor cortex. bioRxiv (2020). doi: https://doi.org/10.1101/2020.10.19.343129
Bruford, E.A., Braschi, B., Denny, P. et al. Guidelines for human gene nomenclature. Nat Genet 52, 754–758 (2020). https://doi.org/10.1038/s41588-020-0669-3
Hodge, R., Bakken, T., et al.. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61-68 (2019). https://doi.org/10.1038/s41586-019-1506-7
Miller, J., et al., Common cell type nomenclature for the mammalian brain. eLife, eLife 2020;9:e59928 (2020). 10.7554/eLife.59928
Tasic, B. et al., Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018). https://doi.org/10.1038/s41586-018-0654-5
Tasic, B., Menon, V., et al., Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nature Neuroscience 19, 335–346 (2016). https://doi.org/10.1038/nn.4216

Glossary of terms

Term	Definition	Example
Taxonomy	Set of quantitatively-derived data clusters defined by a specific computational algorithm on a specific dataset(s). Taxonomies are given a unique label and can be annotated with metadata about the taxonomy, including details of the algorithms, and relevant cell and cell set IDs.	Any clustering result in a cell type classification manuscript
Taxonomy ID	An identifier uniquely tagging a taxonomy of the format CCN[YYYYMMDD][#].	CCN20191012
Dataset	Feature information (e.g., gene expression) and associated metadata from a set of cells collected as part of a single project.	Gene expression from 6,000 human MOp nuclei
Cell	A single entry in a taxonomy representing data from a single cell (or cell compartment, such as the nucleus). Cells have meta-data including a unique ID.	'AAACCCAAGGATTTCC-LKTX_190129_01_A01' from human MOp
Cell set	Any tagged group of cells in a taxonomy. This includes cell types, groups of cell types, and potentially other informative groupings (e.g., all cells from a particular donor, organ, cortical layer, or transgenic line). Cell sets have a number of IDs and descriptors (as discussed below) and can also have other meta-data.	A cell type, A group of cell types, All cells from layer 2 in MTG, All cells from donor X
Provisional cell type	Quantitatively derived data cluster defined within a taxonomy. This is a specific example of a cell set that is of high importance, as most other cell sets are groupings of one or more provisional cell types.	A cell type defined in a specific manuscript; A leaf node in a clustering tree
Cell set accession ID	A unique ID across all tracked datasets and taxonomies. This tag labels the taxonomy and numbers each cell type. CS[taxonomy id]_[unique # within taxonomy]	CS201910120_1
Cell set label	An ID unique within a single taxonomy that is used for assigning cells to cell sets defined as a combination of multiple “provisional cell types”.	MTG 12, MTG 1-8
Cell set alias	Any cell set descriptor. It can be defined computation-ally based on the data, or manually based on prior knowledge or new experiments, or a combination of both. Cell set aliases beyond the “preferred” or “aligned” are defined as “cell set additional aliases”.	(any “cell set aligned alias”), Interneuron 1, Rosehip
Cell set preferred alias	The primary cell set alias (e.g., what cell types are called in a paper). This can sometimes match the aligned alias, but not always, and can be left unassigned.	Inh L1-2 PAX6 CDH12, ADARB2 (CGE), Chandelier, [blank]
Cell set aligned alias	Analogous to “gene symbol”. At most one biologically-driven name for linking matching cell sets across taxonomies and with a reference taxonomy.	L2/3 IT 4, Pvalb 3, Microglia 2
Cell set structure	The location in the brain (or body) from where cells in the associated set were primarily collected.	Neocortex
Cell set ontology tag	A tag from a standard ontology (e.g., UBERON) corresponding to the listed cell set structure.	UBERON: 0001950
Cell set alias assignee	Person responsible for assigning a specific cell set alias in a specific taxonomy (e.g., the person who built the taxonomy or uploaded the data, or a field expert).	(First author of manuscript)
Cell set alias citation	The citation or permanent data identifier corresponding to the taxonomy where the cell set was originally reported.	(DOI of manuscript), [blank]
Reference taxonomy	A taxonomy based on one or a combination of high-confidence datasets, to be used as a baseline of comparison for datasets collected from the same organ system. This is not explicitly part of the CCN	Cross-species cortical cell type classification
Metadata	Additional pieces of information about taxonomies, data sets, cells, or cell sets.	Donor, species, analysis method, RNA-seq platform