To investigate the cellular diversity across human cortex, a low-bias approach to profile cell-type diversity was sought, constrained by the challenge of working with precious and limited tissue sources. Individual layers of cortex were dissected from tissues covering the middle temporal gyrus (MTG), anterior cingulate gyrus (CgGr), primary visual cortex (V1C), primary motor cortex (M1C), primary somatosensory cortex (S1C) and primary auditory cortex (A1C) derived from human brain, and nuclei were dissociated and sorted using the neuronal marker NeuN. In total, nuclei from MTG were sampled from four postmortem donor brains and four neurosurgical donor brains and nuclei from the remaining regions were sampled from 3 postmortem donor brains. Profiled nuclei were derived from approximately 90% neurons and 10% glia, and layer sampling was based on the relative number of neurons in each layer. Additional nuclei were sampled from deep layers of cortex to reflect the transcriptomic diversity of neurons in those layers. More than 49,000 nuclei were sampled across the six cortical regions, resulting in an estimated detection of neuronal cell types as rare as 0.2% of all neurons. Data from matched brain regions in mouse also allows for a direct comparison of cell types between species.
Additional RNA-Seq data is available as part of the Brain Cell Data Center (BCDC) portal.
This database of cell types includes experimental data derived from adult human brain. Human brain tissue from both postmortem and neurosurgical origin was made available through the generosity of tissue donors. Clinical summaries and donor characteristics are provided in this document as well as a description of the criteria for acceptance of use in this study.
To prepare and archive tissues from suitable cases, whole postmortem brain specimens were bisected through the midline, and individual hemispheres were embedded in alginate for slabbing. Coronal brain slabs were cut at 0.5-1cm intervals through each hemisphere and the slabs were then frozen in a bath of dry ice and isopentane, vacuum sealed in freezer bags to prevent frost damage, and stored at -80°C until use. Regions of interest were subsequently removed from tissue slabs, sectioned on a vibratome and processed for nuclei isolation.
Neurosurgical donor tissue (MTG only) was received from patients undergoing surgery for epilepsy or brain tumors. The tissue blocks received were distal, apparently normal cortical tissue removed to access underlying pathological brain tissues. Tissue was transported in chilled ACSF, sectioned at 350µm and stored at -80°C until they were processed for nuclei isolation.
Single nuclei were captured by gating on DAPI-positive events, excluding debris and doublets, and then gating on NeuN signal, which allowed for the isolation of either NeuN-positive (neuronal) or NeuN-negative (non-neuronal) events.
SMART-Seq v4 Ultra Low Input RNA Kit for Sequencing (Takara #634894) was used per the manufacturer’s instructions for cDNA synthesis of single-cell RNA and subsequent amplification. Sequencing libraries were prepared using the NexteraXT DNA Library Preparation kit (Illumina FC-131-1096) with NexteraXT Index Kit V2 Set A, B, C, or D (FC-131-2001, 2002, 2003, or 2004) or custom 8-base or 10-base Unique Design index primers designed and manufactured by IDT (Integrated DNA Technologies). NexteraXT DNA Library prep was done at either 0.5x volume manually or 0.4x or 0.2x volume on the Mantis instrument (Formulatrix). Pooled sequencing libraries were sent to an outside vendor for sequencing on an Illumina HiSeq 2500 instrument. All of the library pools were run using Illumina High Output V4 chemistry. RNA sequencing services were provided by Covance Genomics Laboratory, Seattle subsidiary of LabCorp Group of Holdings, and The Broad Institute Genome Sequencing Platform.
SMART-Seq v4 (1x) amplification
SMART-Seq v4 (0.5x) amplification
Raw read (fastq) files were aligned to the GRCh38 human genome sequence (Genome Reference Consortium, 2011) with the RefSeq transcriptome version GRCh38.p2 (current as of 4/13/2015) and updated by removing duplicate Entrez gene entries from the gtf reference file for STAR processing. For alignment, Illumina sequencing adapters were clipped from the reads using the fastqMCF program. After clipping, the paired-end reads were mapped using Spliced Transcripts Alignment to a Reference (STAR) using default settings. STAR uses and builds it own suffix array index, which considerably accelerates the alignment step while improving on sensitivity and specificity due to its identification of alternative splice junctions. Reads that did not map to the genome were then aligned to synthetic construct (i.e. ERCC) sequences and the E.coli genome (version ASM584v2). The final results files included quantification of the mapped reads (raw exon and intron counts for the transcriptome-mapped reads). This quantification only includes uniquely mappable sequences, which makes up the vast majority of reads. A median of 88.4% of reads are uniquely mappable (range: 45.4-93.7%) compared with only 3.2% that are multi-mapping (range 1.6-10.1%), suggesting that any bias related to exclusion of multi-mappers would be relatively minor. Also, part of the final results files are the percentages of reads mapped to the RefSeq transcriptome, to ERCC spike-in controls, and to E. coli, and summaries of these percentages are saved for quality control assessments. Quantification was performed using summerizeOverlaps from the R package GenomicAlignments.
Expression levels were calculated as counts per million (CPM) of exonic plus intronic reads, and log2(CPM + 1) transformed values were used for a subset of analyses as described below. Gene detection was calculated as the number of genes expressed in each sample with CPM > 0. CPM values reflected absolute transcript number and gene length, i.e. short and abundant transcripts may have the same apparent expression level as long but rarer transcripts. Intron retention varied across genes so no reliable estimates of effective gene lengths were available for expression normalization.
Nuclei were included for clustering analysis if they passed all of the following QC thresholds:
> 30% cDNA longer than 400 base pairs
> 500,000 reads aligned to exonic or intronic sequence
> 40% of total reads aligned
> 50% unique reads
TA nucleotide ratio > 0.7
Nuclei were grouped into transcriptomic cell types using an iterative clustering procedure similar to the analysis used (Tasic et al. 2018; Hodge, Bakken et al., 2019). Briefly, intronic and exonic read counts were summed, and log2-transformed expression (CPM + 1) was centered and scaled across nuclei. X- and Y-chromosomes were excluded to avoid nuclei clustering based on sex. Many mitochondrial genes had expression that was correlated with RNA-seq data quality, so nuclear and mitochondrial genes downloaded from Human MitoCarta2.0 were excluded. Differentially expressed genes were selected while accounting for gene dropouts, and principal components analysis (PCA) was used to reduce dimensionality. Nearest-neighbor distances between nuclei were calculated using up to 20 principal components, Jaccard similarity coefficients were computed, and Louvain community detection (>= 3000 nuclei) or Ward’s agglomerative hierarchical (< 3000 nuclei) clustering methods were used to cluster this graph with 15 nearest neighbors. Marker genes were defined for all cluster pairs using two criteria: 1) significant differential expression (> 2-fold; Benjamini-Hochberg false discovery rate < 0.01) using the R package limma and 2) binary expression (CPM > 1 in more than half of nuclei in one cluster and < 30% of this proportion in the second cluster). Pairs of clusters were merged if either cluster lacked at least one marker gene or if the sum of −log10(adjusted P-value) of differentially expressed genes was less than 40. Clustering was then applied iteratively to each sub-cluster until the occurrence of one of three stop criteria: 1) fewer than 8 nuclei (due to a minimum cluster size of 4), 2) no significantly variable genes, 3) no significant clusters.
To assess the robustness of clusters, the iterative clustering procedure described above was repeated 100 times for random subsets of 80% of nuclei. A co-clustering matrix was generated that represented the proportion of clustering iterations that each pair of nuclei were assigned to the same cluster. We defined consensus clusters by iteratively splitting the co-clustering matrix as described (Tasic et al. 2018; Hodge, Bakken et al., 2019). We used the co-clustering matrix as the similarity matrix and clustered using either Louvain (>= 4000 nuclei) or Ward’s algorithm (< 4000 nuclei). We defined Nk,l as the average probabilities of nuclei within cluster k to co-cluster with nuclei within cluster l. We merged clusters k and l if Nk,l > max(Nk,k, Nl,l) - 0.25 or if the sum of −log10(adjusted P-value) of differentially expressed genes between clusters k and l was less than 40. Finally, we refined cluster membership by reassigning each nucleus to the cluster to which it had maximal average co-clustering. We repeated this process until cluster membership converged.
After clustering, clusters were identified as outliers if more than half of nuclei co-expressed markers of inhibitory (GAD1, GAD2) and excitatory (SLC17A7) neurons or were NeuN+ but did not express the pan-neuronal marker SNAP25. Median values of QC metrics listed above were calculated for each cluster and used to compute the median and inter-quartile range (IQR) of all cluster medians. Clusters were also identified as outliers if the cluster median QC metrics deviated by more than three times the IQRs from the median of all clusters.
Clusters were identified as donor-specific if they included fewer nuclei sampled from donors than expected by chance. For each cluster, the expected proportion of nuclei from each donor was calculated based on the laminar composition of the cluster and laminar sampling of the donor. For example, if 30% of layer 3 nuclei were sampled from a donor, then a layer 3-enriched cluster should contain approximately 30% of nuclei from this donor. In contrast, if only layer 5 were sampled from a donor, then the expected sampling from this donor for a layer 1-enriched cluster was zero. If the difference between the observed and expected sampling was greater than 50% of the number of nuclei in the cluster, then the cluster was flagged as donor-specific and excluded.
To confirm exclusion, clusters automatically flagged as outliers or donor-specific were manually inspected for expression of broad cell class marker genes, mitochondrial genes related to quality, and known activity-dependent genes.
The clustering pipeline is implemented in R package and the clustering method is provided by run_consensus_clust function.
Data generation was supported by multiple awards, including Brain Initiative Cell Census Network (BICCN) award U01MH114812 from the National Institute of Mental Health and the National Institute of Neurological Disorders and Stroke, and by the Allen Institute for Brain Science.