image representing the current explore topic

MapMyCells: Input file requirements, limits, and creation

Input h5ad file requirements

We currently accept h5ad files of up to 4GB for your expression data. These files are produced by AnnData, a widely used tool for creating, manipulating, and saving large data matrices, such as for expression data.

This mapping tool requires simply the cell by gene information of your dataset. Other data in the h5ad file will be ignored for mapping. Rows are expected to be cells and columns are expected to be genes.

For optimal results & performance, your data should be raw count data in a sparse CSR or dense matrix. It should use mouse-based or human-based Ensembl gene IDs to match the respective taxonomy.

Example input files

Cross species mappings require appropriate orthologs in your input data. Translate NCBI gene IDs to Ensembl gene IDs across species with the R package in this GitHub repository.

Creating h5ad input files in R & Python

Using Python

Read your data into Python however you want.  Some examples include:

1. If your data is stored as a csv file with sample names as columns and gene names in the first row:

As a specific example, we'll download SmartSeq data from mouse VISp from here (https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq) and convert to h5ad.

Prior to running any code, download the data from this file (https://celltypes.brain-map.org/api/v2/well_known_file_download/694413985)[292MB] to your working directory, and unzip it.

import pandas as pd

Read in exon and intron counts. Combine into a single count matrix.

exons = pd.read_csv("mouse_VISp_2018-06-14_exon-matrix.csv")
exons.rename({exons.columns[0]: 'gene_entrez_id'}, axis=1, inplace=True)

introns = pd.read_csv("mouse_VISp_2018-06-14_intron-matrix.csv")
introns.rename({introns.columns[0]: 'gene_entrez_id'}, axis=1, inplace=True)

Ignoring the first column of each DataFrame, which is just the sample ID

count_matrix = exons[exons.columns[1:]].to_numpy()
count_matrix += introns[introns.columns[1:]].to_numpy()

Store sample-based metadata in a dataframe called obs

obs = pd.DataFrame(
        [{'sample_id': s} for s in exons.columns[1:]]).set_index('sample_id')

Store gene-based metadata in a dataframe called var

var = exons[['gene_entrez_id']]

add gene symbols and set those as the index

gene_symbol_df = pd.read_csv('mouse_VISp_2018-06-14_genes-rows.csv')
var = var.join(gene_symbol_df.set_index('gene_entrez_id'), on='gene_entrez_id')
var = var.set_index('gene_symbol')

Transpose counts so samples are rows and genes are columns.

Since this file is large we'll just read in the first 10 cells for this example.

Be sure to downsample obs as well. anndata expects obs to have as many rows as the count matrix.

count_matrix = count_matrix[:, :10].transpose()
obs = obs[:10]

 

2. If your data is stored as hdf5 format with genes as columns and sample names as rows:

Read in the data with the h5py library.

As a specific example, we'll download 10x data from mouse cortex and hippocampus from here (https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-10x). 

Prior to running any code, download the data from this file (https://idk-etl-prod-download-bucket.s3.amazonaws.com/aibs_mouse_ctx-hpf_10x/expression_matrix.hdf5)[5.3GB] to your working directory.

import h5py
import pandas as pd

src = h5py.File('expression_matrix.hdf5', 'r')

If we were reading all of the samples, we would use

count_matrix = src['data/counts'][()].transpose()

Only get first 10 samples for speed. 

Transpose the counts matrix because anndata expects each row to be a sample and each column to be a gene.

count_matrix = src['data/counts'][:, :10].transpose()
gene_symbols = src['data/gene'][()]
sample_names = src['data/samples'][()][:10]

src.close()

Store sample-based metadata in a dataframe called obs

obs = pd.DataFrame(
    {'sample_name': s} for s in sample_names).set_index('sample_name')

Store gene-based metadata in a dataframe called var

var = pd.DataFrame(
    {'gene_symbol': g} for g in gene_symbols).set_index('gene_symbol')

At this point, you should have a matrix (dense or sparse) with samples as rows and columns as genes.

 

Save the data to an h5ad file using the "anndata" python library
import anndata
import scipy.sparse

Convert count matrix to a CSR (row-based access) sparse matrix. (this is optional; just do not save as a CSC sparse matrix)

count_matrix = scipy.sparse.csr_matrix(count_matrix)

countAD = anndata.AnnData(
     X=count_matrix,
     obs=obs,
     var=var)

countAD.write_h5ad("counts.h5ad")

Using R

Read your data into R. This can be done in different ways. Some examples include:

1.) If your data is stored as a csv file with sample names as columns and gene names in the first row:

Read in the data with "fread" in the "data.table" R library".

As a specific example, we'll download SmartSeq data from mouse VISp from here (https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq) and convert to h5ad

Prior to running any code, download the data from this file (https://celltypes.brain-map.org/api/v2/well_known_file_download/694413985)[292MB] to your working directory, and unzip it.

library(data.table)  # Load library
exons    <- as.matrix(fread("mouse_VISp_2018-06-14_exon-matrix.csv"),rownames=1)
introns  <- as.matrix(fread("mouse_VISp_2018-06-14_intron-matrix.csv"),rownames=1)
t_counts <- exons + introns

Count matrix has Entrez ID as row names, and this file below has gene symbols, if needed.

gene_info <- read.csv("mouse_VISp_2018-06-14_genes-rows.csv",row.names=1) # This file has gene names

Transpose counts so samples are rows and genes are columns.

Since this file is large we'll just read in the first 10 cells for this example.

counts  <- t(t_counts[,1:10])

 

2.) If your data is stored as hdf5 format with genes as columns and sample names as rows

Read in the data with "h5read" in the "rhdf5" R library.

As a specific example, we'll download 10x data from mouse cortex and hippocampus from here: https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-10x.

Prior to running any code, download the data from this file (https://idk-etl-prod-download-bucket.s3.amazonaws.com/aibs_mouse_ctx-hpf_10x/expression_matrix.hdf5)[5.3GB] to your working directory.

library(rhdf5) # Load library
h5ls("expression_matrix.hdf5") # This will let you find the location of the counts

Since this file is large we'll just read in the first 10 cells for this example.

counts  <- h5read("expression_matrix.hdf5", "/data/counts", index=list(1:10,NULL))
genes   <- h5read("expression_matrix.hdf5","/data/gene")
samples <- h5read("expression_matrix.hdf5","/data/samples")[1:10]
colnames(counts) <- as.character(genes)   # Add row names
rownames(counts) <- as.character(samples) # Add column names

At this point, you should have a matrix (dense or sparse) with samples as rows and columns as genes.

 

Save the data to an h5ad file using the "anndata" r library:
library(anndata) # Load library
genes   <- colnames(counts)
samples <- rownames(counts)
sparse_counts <- as(counts, "dgCMatrix")  # Convert to sparse matrix, if not already
countAD <- AnnData(X   = sparse_counts,   # Create the anndata object
                   var = data.frame(genes=genes,row.names=genes),
                   obs = data.frame(samples=samples,row.names=samples))
write_h5ad(countAD, "counts.h5ad") # Write it out as h5ad

Reducing size of h5ad files in R & Python

It's possible to bring a large h5ad file into compliance with the 4GB file limit by reducing it to the mendatory adata.Xadata.var_names, & adata.obs_names and removing all other data and metadata.

Using Python

The below steps minimized a 6.3GB h5ad file containing ~41,000 cells by 500 genes to a 120MB h5ad file.

import anndata as ad

Load your existing h5ad file that is >500MB.

my_too_large_adata = ad.read_h5ad('path/to/file/my_data.h5ad')

Copy over just the cell-by-gene table, .X, into a new AnnData object.

minimal_adata = ad.AnnData(my_too_large_adata.X)

Copy over the gene identifiers (required for mapping).

# (a) if gene identifiers are stored in adata.var.index:
minimal_adata.var_names = my_too_large_adata.var_names

# (b) if gene identifiers are stored in adata.var['your_column']:
minimal_adata.var_names = my_too_large_adata.var['your_column']

Copy over cell identifiers (helpful for keeping track during later analyses of mapping results).

smaller_adata.obs_names = my_too_large_adata.obs_names

Write minimal adata (hopefully now <500MB!) to h5ad file.

minimal_adata.write('path/to/file/new_adata_for_MapMyCells.h5ad', compression='gzip')

Using R

The below steps decrease the size of an existing h5ad file down to the minimal possible file that can be mapped. They leave only adata$X, adata.var_names, & adata.obs_names and remove all additional data and metadata. Example: 1.5GB h5ad file containing ~5,000 cells by ~36,000 genes minimized to 170MB h5ad file.

library(anndata)

Load your existing h5ad file that is >500MB.

my_too_large_adata = read_h5ad('path/to/file/my_data.h5ad')

Copy over just the cell-by-gene table, .X, into a new AnnData object.

minimal_adata = my_too_large_adata$X

Copy over the gene identifiers (required for mapping).

# (a) if gene identifiers are stored as colnames of my_too_large_adata$X, you can skip this step
# (b) if gene identifiers are stored in adata$var['your_column']:
colnames(minimal_adata) = my_too_large_adata$var[['your_column']]

Copy over cell identifiers (helpful for keeping track during later analyses of mapping results) [OPTIONAL].

# (a) if cell identifiers are stored as rownames of my_too_large_adata$X, you can skip this step
# (b) if cell identifiers are stored in adata$obs['your_column']:
rownames(minimal_adata) = my_too_large_adata$obs[['your_column']]

Write minimal adata (hopefully now <500MB!) to h5ad file.

ad <- AnnData(
  X = minimal_adata,
  obs = data.frame(group = rownames(minimal_adata), row.names = rownames(minimal_adata)),
  var = data.frame(type = colnames(minimal_adata), row.names = colnames(minimal_adata))
)
write_h5ad(ad,'path/to/file/new_adata_for_MapMyCells.h5ad', 
                    compression='gzip')