image representing the current explore topic

Cell Type Nomenclature

Overview

The challenge of a colloquial and intuitive naming convention is problematic for any ontology or taxonomy. Yet it is necessary to have an accurate and flexible nomenclature to describe and analyze existing and upcoming data that is at the cellular level. The Cell Type Ontology Workshop (held at the Allen Institute in Seattle, June 17-18, 2019) convened experts in the fields of ontology, taxonomy and neuroscience to make recommendations, highlight best practices and propose conventions for naming cell types. 

As an outcome of this workshop, a framework for developing nomenclatures was constructed.  A primary objective was to develop a system that would be straightforward and allow designation of cell types with or without hierarchical organization.  This nomenclature is initially applied to brain cells, and is intended to encompass existing naming strategies used in publications across diverse research teams. A broader goal is to allow tracking of many different taxonomies, including those from different organ systems or across diverse areas of bioscience. 

Here we describe the working framework for creating brain cell type nomenclature, and include examples using published datasets.  We introduce the concept of a "cell set" with properties that, in principle, can be tracked in a qualitatively similar manner similar to how transcripts are tracked across different versions of GENCODE genome builds.  We use the term "provisional cell type" to represent quantitatively-derived data clusters defined by running a specific computational algorithm on a specific data set (here called a "taxonomy").  Stated differently, a taxonomy is a system for tracking structured tertiary analysis of data.  In this nomenclature, provisional cell types are cell sets, whereas cell sets can be any group of cells defined within a specific taxonomy, and both can be treated identically for future ontology building.  Finally, we seek input from the community on all aspects of this process, through the Allen Brain Map Community Forum.

Naming Cell Types

 

Classification requires disambiguation;  for the purposes of use here, we must use broad terms that may be open to multiple interpretations.  A glossary of terms is summarized below, including examples of how the terms are used. Cell types and cell sets both are assigned multiple identifier tags, which are used for different purposes.  Cell set accession IDs can track unique cell sets across the entire universe of taxonomies.  Cell set aliases can provide both data-driven and common-usage cell type names for public consumption (e.g., in manuscripts).  Cell set labels provide general tags that allow easy tracking of groups of cell types within a single taxonomy that could span many organ systems.  More details about these descriptors and an example based on a recently published taxonomy are presented below.  Note that we do not define a "cell type name" since this is a charged term that has different meanings for different people.

 

Glossary of terms

Descriptor

Brief description

Example

Details

Cell type accession ID

An ID uniquely identifying a cell type across all possible data sets and taxonomies.

CS1910120001

The format is concatenated, where:

  • YYMMDD represents a 6 digit date format (Y=year, M=month, D=day)

  • T is a 1-digit taxonomy counter, which allows up to 10 on the same date

  • SSS is a 3-digit cell set counter, which allows up to 1000 cell sets for the same taxonomy

Together, this ID will match a specific taxonomy (described below)

Cell type alias

Cell type names for use in publications; based on gene expression and other data-driven features.

Key point: any cell type name in a manuscript can be considered the cell type alias with no modifications needed.

Inh L1-2 PAX6 CDH12

Sst Chodl_1

L2/3 IT Cxcl14_3

There is no scripted format for cell type aliases to allow flexibility for different groups to define cell types however they would like (but that should be based on available data).  Currently the mouse and human taxonomy use different conventions for aliases.

  • For mouse cortex + hippocampus the format is such that the subclass contains some combination of information about canonical gene expression, primary layer or brain region localization, projection targets, and broad cell class (e.g., astrocyte, VLMC).

  • For human cortex, the format is described in this recent study in Nature.1

Cell type alternative  alias(es)

Additional aliases for a cell type.  These could include mapping to common-usage names, when possible.

Chandelier 1

Rosehip

Sst Chodl

Cell types can have multiple aliases, if additional information is known about the cell type.  Currently only a small subset of cell types have alternative aliases, and only in cases where a specific transcriptomic cell type is known to link with either a common-usage, anatomically-defined cell type (e.g., Chandelier) or a specific transcriptomic type from a different taxonomy (e.g., Sst Chodl).  In later releases, a "primary alias" will be defined, which will allow tracking the same cell type across multiple taxonomies defined using data from multiple modalities.  Similarly, this is the slot for mapping to "common usage types" in the Neuron Phenotype Ontology (NPO).

Cell type label

A short name with a header that is part of a dictionary of broad cell types (e.g., Neuron) followed by a number.

Neuron 13

In this case, either Neuron or Non-neuron, followed by a counter that matches the order that a given cell type occurs in the dendrogram for the relevant taxonomy.  These broad names are useful for several reasons.  First, they allow our nomenclature scheme to be extensible to many organ systems.  Second, they allow for cell sets to easily refer to specific cell types (as described below). Third, they provide stable cell type names while aliases are refined that are appropriate to each brain (or other tissue) region.

Example:  Human brain - middle temporal gyrus & cell type aliases

The upper right of this diagram shows an annotated dendrogram, or hierarchical organization of cell types defined for a specific taxonomy, in human middle temporal gyrus, as described in this recent study in Nature1.  Below the dendrogram, the names of cell types that presented in this publication are given.  On the left is an inset with the provisional cell type alias, label, and accession IDs for a subset of cell types, as they would be defined in the current nomenclature schema.  Although not shown, Inh L1-4 LAMP5 LCP2 corresponds to Rosehip cells (described in this recent study in Nature Neuroscience2) and therefore a cell type alternative alias for this type would be "Rosehip" as other nodes highlighted in blue might also have colloquial aliases.

 

Naming Cell Sets

Cell sets can represent any collections of cells within a specific taxonomy (including cell types), and are defined using the same terms as above.  Currently, the only cell sets we define (other than provisional cell types) are as internal nodes of a clustering tree.  More details about these descriptors in this context and an example based on a recently published taxonomy are presented below.

Glossary of terms

Descriptor

Brief description

Example

Details

Cell set accession ID

An ID uniquely identifying a cell set across all possible data sets or taxonomies.

CS1910120231

The format is <cs><yymmdd><t><sss>as defined above.  Cell sets will have accession IDs distinct from those defined for cell types from the same taxonomy.  Currently, cell sets only refer to internal nodes of the clustering dendrogram; however, in principle they could be defined for any set of cells.</sss></t></yymmdd></cs>

Cell set alias

Either [blank] or a flexible descriptor that approximates the features of the included cell types.

ADARB2 (CGE)

FEZF2

Interneuron

As with cell types, cell sets can have multiple aliases that can describe different aspects of the data.  Currently, most cell sets that refer to internal nodes will not have aliases.  A subset of nodes will contain some combination of information about canonical gene expression, primary layer or brain region localization, projection targets, and broad cell class (e.g., neuron, astrocyte, VLMC).  In later iterations, a "primary alias" will be defined, which will allow tracking the same cell set across multiple taxonomies defined using data from multiple modalities, regardless of whether these sets are provisional cell types or are higher in the hierarchical tree.

Cell set label

A concatenation labels for the included cell types.

Neuron 1-3

Non-neuron 1-9

Cell set labels succinctly define which cell types are included in a given cell set.  This value is meaningful for any cell sets which include all cells from one or more cell type, whether or not these cell types are organized in a hierarchical structure.  Refinements will be necessary for arbitrarily-defined cell sets.

Example:  Human brain - middle temporal gyrus & cell sets

This diagram shows example descriptors for three cell sets, shown on the same dendrogram presented earlier (human brain -middle temporal gyrus & cell type aliases), with example nodes in blue.  In this case, only one of the three nodes has a cell type alias, which refers to expression of specific canonical genes.  In all three cases, the cell set label indicates the cell types included within that node in the tree.

 

Tracking taxonomies

Here we define a taxonomy as a particular version of a particular computational algorithm applied to a particular set of data.  While our example data set is hierarchically structured, and taxonomies are typically associated with hierarchically structured data, this schema can be generally applicable to any cell sets resulting from any tertiary analysis.   Since it is likely that many different taxonomies will be generated, that include overlapping sets of cells and similar but modified clustering strategies, it is critical to track them.  Such tracking will allow for reproducibility of results, among other advantages.  Below is a current tracking schema for taxonomies, subject to change.

Descriptor

Example value

Description

taxonomy accession ID

CS1910121

This value matches the cell set/type accession IDs above and has the format <cs><yymmdd><t>.</t></yymmdd></cs>

short_name

AIT3.0_human

This name contains only critical information about the taxonomy, including institute (AI = Allen Institute), modality (T = transcriptomics), version (3.0), and species (human).

long_name

2019_human_Cortex_t120_SSv4

This name includes additional information and is of the format <year>_<species>_<brain region="">_<modality><# clusters>_<experimental platform=""> (SSv4 = SMART-Seq V4).</experimental></modality></brain></species></year>

subset

Cortex

Which region was included in the taxonomy

Dendrogram conf.th.

40

A confidence threshold for the clustering dendrogram.

Date built

20190301

Date the taxonomy was created

Species

Homo sapiens

Species (Latin name)

#Types

120

Number of cell types

Code version

>1.4

Version of the code

De.param

q1.th=0.5, q.diff.th=0.7, de.score.th=20, min.cells=4, "directional" 

Various parameters used for clustering in the hicat.scrattch R library, as published in this mouse scRNA-Seq study of V1 and ALM 3

Usage:  Application of nomenclature conventions to novel or published data sets

Any nomenclature schema is only useful if it is adopted by many scientists studying cell types across multiple institutes and organ systems.  Therefore, we have aimed to make this schema easy to understand, lightweight, and easy to apply to novel or published data sets, and have provided both tools and examples for this nomenclature can be applied to any cell-level data.  Code for applying this nomenclature is available in the `allen_institute_nomenclature` github repository.  An example of how this schema can be applied to an existing data set is shown below. 

 

In a study of mouse primary visual cortex using ~1700 cells4, a total of 42 neuronal and 7 non-neuronal cell types were proposed, and cell type-specific mRNA processing and genetic access to these transcriptomic types was accomplished through the use of many transgenic Cre lines.   This work was published in Nature in 20164, and the data can be browsed with an interactive navigation application.  Above, we have taken the hierarchical organization of cell types and cell type aliases as described and have applied the nomenclature schema to append cell type labels and accession IDs for each cell type (right column).  We have also annotated a subset of the internal nodes with a cell set alias, label, and accession ID, as described.  Note that by default, node aliases are left blank (e.g. nodes C and F), but nodes representing a useful collection of cell types can be manually tagged with an alias (e.g., node "B" contains all Pvalb types and no others).  Details and code for reproducing this analysis are in the `allen_institute_nomenclature` github repository

 

Nomenclature and Reference Data Sets

There is compelling evidence for the existence of distinct cell types based on robust groupings of cells by observable and measurable cell attributes. Given the evidence of generally close correlation between molecular, physiological, and morphological cell characteristics noted above, the field is moving toward the notion of defining a reference cell type classification system based on the clustering of single cell/nucleus transcriptomic (scRNA-Seq) data as an initial framework, and then layering on additional phenotypic data as they become available, especially from multimodal (e.g., Patch-Seq type) assays.  Currently, many groups are performing scRNA-Seq analysis in different areas of the brain (or body), from multiple mammalian species, and across trajectories of development, aging, and disease, meaning that the definition of a reference (or consensus) cell type is likely to be a moving target and that tracking of cell type definitions using appropriate ontology, data structure, and nomenclature is critical.  With this in mind, it will be important to associate specific experiment and analysis parameters (experimental design, sample source, scRNA-Seq platform used, data processing methods and clustering or other algorithms used, etc.) and clustering results with the latest reference cell types to build a comprehensive knowledge base.  Although the primary focus here is nomenclature, we highlight the general need for reference data sets, as they are a critical consideration for usable taxonomies.  We expect to apply an organizational framework including reference classification to cell data planned for release in 2020.

Defining and refining reference cell types

A general workflow is presented below, to assist with defining and refining reference cell types in a generalizable way. This workflow accomodates for methodological differences in cell type definitions that will likely vary between labs and change as new methods are developed.  Note that this workflow makes some major assumptions about cell type ontology, data visualization, and governance that will be discussed in the next section.  

This workflow can be broken down into four broad stages:

  1. First, many research teams will independently define cell types, identify their discriminating features, and name them using one of many available experimental and computational strategies.  The example workflow reflects the current state of the field.  We propose a formalization to the nomenclature, which can (and should!) be applied to each data set independently.

  2. Second, an initial reference cell type classification will be defined by taking the results from one or more (ideally validated) data sets and integrating these data together in a single analysis.  Transcriptomics is well suited to allow quantitative comparisons across species, across developmental time, and between brain and other organs, and our goal is for reference cell types to be defined using this modality. For example, scAlign and Seurat have been used to align data from human and mouse cortex into a single consensus cell type classification (see below).  Once reference cell types are defined, features discriminating cell types should be calculated separately in each data set so that canonical discriminating features can be separated from ones represented in only a subset of data sets.  These features, along with additional meta-data available for these cell types from any of the integrated data sets will be used to match cell types to existing ontologies where possible, and to update ontology terms as needed.  Finally, reference cell types will be named as described in this nomenclature convention, with some additional constraints discussed below.  Note: while we present a reference classification here as an integration of multiple data sets, it is perfectly valid to have a single data set stand alone as a reference data set.

  3. Third, this reference cell type classification will now be used as a comparator for any related data sets, providing a mechanism for transferring prior knowledge about cell types across data sets.  Existing data can be renamed by mapping cell sets onto the reference classification and then updating the cell set primary alias to match the reference.  Similarly, new data sets can either be clustered independently and mapped to the reference or directly integrated with the reference.  In either case the resulting nomenclature will be informed by the reference data set.

  4. Finally, new versions of the reference cell type classification will be periodically generated using additional data and/or computational methods, and this new classification will now be used as comparator for related data sets.  Steps 3 and 4 can iterate at some to-be-defined cadence.

 

Cell type ontology curation, dissemination and governance

For the workflow above to be useful and adopted, additional progress is required, on several fronts.  First and most importantly, a governing body that will be respected by an international scientific user community needs to assemble and take on several key tasks.  This governing body will decide which data sets to include in a reference, find a place and standard format to store the reference, provide tools for visualization of the reference data alone and mapping of data onto the reference, and provide a framework for annotating cell types, among other things.  This group will also be responsible for vetting a standard ontology for organizing data, along with a controlled vocabulary for assigning cell type nomenclature.  Among others, potential entities that may provide guidance for this governing body include the BRAIN Initiative - Cell Census Network (BICCN), whose charter is to provide researchers and the public with a comprehensive reference of the diverse cell types in human, mouse, and marmoset brain, and the Human Cell Atlas (HCA), whose goal is to build a atlas of all cells in the human body, and which already has many of the required mechanisms in place for governance of this cell type classification workflow.  The Neuroinformatics Information Framework (NIF) and INCF are other communities of experts to look toward.

Reference cell type nomenclature

In summary, we propose a formalization to cell type nomenclature that can be applied to (ideally) all cell type classifications.  The term "provisional cell type" is used to represent quantitatively-derived data clusters defined by running a specific computational algorithm on a specific data set (here called an "taxonomy").  All cell sets (including provisional cell types) are assigned multiple identifier tags, which are used for different purposes.  Cell set accession IDs can track unique cell sets across the entire universe of taxonomies.  Cell set aliases can provide both data-driven and common-usage cell type names for public consumption (e.g., in manuscripts).  Cell set labels provide general tags that allow easy tracking of groups of cell types within a single taxonomy that could span many organ systems.  These terms apply to reference cell type classifications as well.  In addition, we define a reference cell type classification as an taxonomy applied to the results of a reference cell type analysis, and expand the scope of a primary alias.

 

In a reference cell type classification, the primary alias is designed to allow tracking of the same cell type across multiple taxonomies defined using data from multiple modalities.  For this reason, it should match (directly map to) cell types defined in the relevant ontology (i.e., common usage types in the Neuron Phenotype Ontology).  Either this, or other cell set aliases should include prior knowledge (provenance), canonical discriminating genes, and/or information from other modalities (such as electrophysiological properties, if available) to provide the best data to match cell types in the reference with cell types defined previously, or in future taxonomies from any modality. 

 

Example:  Human & mouse cortex cell type nomenclature

Above we present cell types defined in human middle temporal gyrus using snRNA-Seq (A; Hodge, et al.1) and in mouse primary visual cortex (VISp) and ALM using scRNA-Seq (B; Tasic, et al. 3).  These studies identified ~75-100 types per cortical area per species from ~15,000 cells, which each could be defined using one (or a combination of) robust marker genes. In addition, we characterized these cell types by associated relevant meta-data from the assigned cells, including things like cortical layer of dissection, brain region, alignment statistics, and (in mouse) projection targets of a subset of cells.  We then performed data integration on these two data sets using scAlign and including only cells from V1 in mouse, and arrived at ~40 reference cell types (C).  The resulting data set included some one-to-one matches (e.g., a single mouse cell type matches a single human data set), with many of the remaining reference types matching to internal nodes of the tree (e.g., cell sets that are not defined as putative cell types).  From this data,  primary aliases were assigned using a combination of (i) robust gene markers from the literature, (ii) highly discriminating gene markers in these data, (iii) projection targets in mouse, (iv) historical names based on cell shape, and (v) broad cell class names (that directly map to ontologies).  In principle, this homologous cell type classification could be used as a cross-species reference cell type classification; however, now that data on additional cortical areas is available using multiple RNA sequencing platforms, it seems prudent to await data integration that will provide a more comprehensive reference.

 

Future development and gathering community input

This is a small step in a long and iterative process involving a wide community, and with cross-disciplinary partnership, we anticipate refinement in the months and years to come. The Allen Brain Map Community Forum is a venue available for discussion related to cell taxonomies as a starting point for exchanging ideas and input from the community in an open way. This forum is open to all and we encourage your participation, both by providing feedback on our nomenclature schema and by providing examples of this schema applied to your own data sets.

References

  1. Hodge, R., Bakken, T., et al.. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61-68 (2019).  https://doi.org/10.1038/s41586-019-1506-7

  2. Boldog, E., Bakken, T., Hodge, R, et al.. Transcriptomic and morphophysiological evidence for a specialized human cortical GABAergic cell type. Nat Neurosci 21, 1185–1195 (2018). https://doi.org/10.1038/s41593-018-0205-2

  3. Tasic, B. et al., Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018). https://doi.org/10.1038/s41586-018-0654-5

  4. Tasic, B., Menon, V., et al., Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nature Neuroscience 19, 335–346 (2016). https://doi.org/10.1038/nn.4216