Overview
This dataset – The Human Postmortem-derived Brain Sequencing Collection (PMDBS) – is a harmonized repository composed of single nucleus and PolyA RNA-seq data contributed by multiple ASAP CRN teams. The goals in curating this dataset aim to create a uniformly aligned and processed reference, where the methods and choices of processing are explicit and identical among all contributed data. Our aim was to make conservative (i.e. permissive) choices (usually the ‘default’ of the tools), and leave more nuanced considerations to scientists testing specific hypotheses. Below we outline the Quality Control (QC) steps performed. The complete workflow used for processing the platformed data can be found at the ASAP CRN github repository: pmdbs-sc-rnaseq-wf .
Quality Control
The processing and integration of the PMDBS samples can described as consisting of two steps of “pre-processing” followed by “processing” which is followed by “integration”. QC steps occur during “pre-processing” and by an initial filtering at the beginning of processing. It is worth pointing out that a common implicit (but rarely stated) assumption related to QC is the inclusion of raw data sample (FASTQ) as part of the dataset contribution. This zero-th level of QC is simply the result of the contributing scientist’s decision to include the FASTQ files in a dataset.
Preprocessing
The next QC step is built into standard CellRanger (v7.1.0) for aligning FASTQ files which identifies and removes empty droplets and dead cells. This workflow preserves additional QC reports, but are not considered in subsequent steps. CellBender (v0.3.0), in turn, removes background RNA noise, improving cell-droplet separation CellBender is run with default parameters. Likewise, the workflow preserves the CellBender reports, but are not considered in subsequent steps.
Filtering
Finally, as part of “pro Standard `scrublet` (v0.2.3) is employed via `scanpy` to create doublet scores, and all observations are then filtered at the following levels:
Filtering happens at the beginning of “processing”. Cells are excluded based on Doublet Scores (from `scrublet`, v0.2.3 via `scanpy`), Mitochondrial gene percentages, Total counts, and Number of genes per cell. For this dataset the thresholds were defined to be:
Mitochondrial gene percentage < 10%
Doublet_scores < 0.2
Total counts between 100, and 100,000
Number features per cell between 100 and 10,000
Dimension Reduction
Dimension reduction to aid in visualization was done by computing a 2D UMAP. The UMAP was computed from the nearest neighbor graph of the 30 dimensional latent space defined by an `scVI` model fit to the raw counts for the top 3k highly variable genes (HVG). The `scVI` model which performs variational inference to express the expected gene expression underlying our raw counts. Details are the integrate_scvi.py, and clustering_umap.py script from the pmdbs-sc-rnaseq-wf github repo.
Caveats
A very naive model of gene expression variability was deliberately chosen as the means to equalize variance due to batch effects (both within Team’s contribution of samples, and between contributions). This model treats all observations as equivalent regardless of disease state or putative neuronal/non-neuronal. Future work which models these additional sources of variability, could result in UMAP representations which better localize the cell_types inferred here. Other potential drivers of overlapping representations of cells in our UMAP could be attributed to the limitations of only 3k HVG, or simplicity of the ‘scVI’ models architecture.
Update your browser to view this website correctly.