Cell Ranger2.0, printed on 09/11/2024
The cellranger pipeline outputs two types of gene-barcode matrices.
Type | Description |
---|---|
Unfiltered gene-barcode matrices | Contains every barcode from fixed list of known-good barcode sequences. This includes background and non-cellular barcodes. |
Filtered gene-barcode matrices | Contains only detected cellular barcodes. |
The cellranger pipeline generates a gene-barcode matrix per species. Each matrix is stored in Market Exchange Format (MEX). It also contains TSV files with genes and barcode sequences corresponding to row and column indices, respectively. For example, if cellranger is run with a human reference, the matrices output may look like:
$ cd /home/jdoe/runs/sample345/outs $ tree filtered_gene_bc_matrices filtered_gene_bc_matrices └── hg19 ├── barcodes.tsv ├── genes.tsv └── matrix.mtx 2 directories, 3 files
Genes correspond to row indices. For each gene, its gene ID and gene name are stored in the first and second column of the genes.tsv file, respectively.
$ head filtered_gene_bc_matrices/hg19/genes.tsv ENSG00000243485 MIR1302-10 ENSG00000237613 FAM138A ENSG00000186092 OR4F5 ENSG00000238009 RP11-34P13.7 ENSG00000239945 RP11-34P13.8 ENSG00000237683 AL627309.1 ENSG00000239906 RP11-34P13.14 ENSG00000241599 RP11-34P13.9 ENSG00000228463 AP006222.2 ENSG00000237094 RP4-669L17.10
Gene ID corresponds to gene_id in the annotation field of the reference GTF. Similarly, gene name corresponds to gene_name in the annotation field of the reference GTF. If no gene_name field is present in the reference GTF, gene name is equivalent to gene ID.
For multi-species experiments, gene IDs and names are prefixed with the genome name to avoid name collisions between genes of different species e.g. GAPDH becomes hg19_GAPDH and Gm15816 becomes mm10_Gm15816.
Barcode sequences correspond to column indices.
$ head filtered_gene_bc_matrices/hg19/barcodes.tsv AAACATACAAAACG-1 AAACATACAAAAGC-1 AAACATACAAACAG-1 AAACATACAAACGA-1 AAACATACAAAGCA-1 AAACATACAAAGTG-1 AAACATACAACAGA-1 AAACATACAACCAC-1 AAACATACAACCGT-1 AAACATACAACCTG-1
Each barcode sequence includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCAT-1
More details on the barcode sequence format are available in the barcoded BAM section.
R and Python support MEX format, and sparse matrices can be used for more efficient manipulation.
The cellrangerRkit library is needed to load a gene-barcode matrix into R. Assuming you have R installed and Rscript
on your $PATH
, this can be installed by running the following command:
$ cellranger install-rkit
For example, running the following code loads the filtered human (hg19) gene-barcode matrix into R:
library(cellrangerRkit) genome <- "hg19" gene_bc_matrix <- load_cellranger_matrix("/opt/sample345", genome=genome)
To load a gene-barcode matrix from another species, you will need to edit the genome variable above. For example, to load the filtered mouse (mm10) gene-barcode matrix, you would set genome <- "mm10" and rerun the script above.
The csv, os and scipy.io libraries are recommended for loading a gene-barcode matrix into Python.
import csv import os import scipy.io genome = "hg19" matrices_dir = "/opt/sample345/outs/filtered_gene_bc_matrices" human_matrix_dir = os.path.join(matrices_dir, genome) mat = scipy.io.mmread(os.path.join(human_matrix_dir, "matrix.mtx")) genes_path = os.path.join(human_matrix_dir, "genes.tsv") gene_ids = [row[0] for row in csv.reader(open(genes_path), delimiter="\t")] gene_names = [row[1] for row in csv.reader(open(genes_path), delimiter="\t")] barcodes_path = os.path.join(human_matrix_dir, "barcodes.tsv") barcodes = [row[0] for row in csv.reader(open(barcodes_path), delimiter="\t")]
Similarly with R to load a gene-barcode matrix from another species, you will need to edit the genome variable above.
Cell Ranger represents the gene-barcode matrix using sparse formats (only the nonzero entries are stored) in order to cut down on file size. All of our programs, and many other programs for gene expression analysis, support sparse formats.
However certain programs (e.g. Excel) only support dense formats (where every row-column entry is explicitly stored, even if it's a zero). You can convert a gene-barcode matrix to dense CSV format using the cellranger mat2csv command. This command takes two arguments - an input matrix generated by Cell Ranger (either an H5 file or a MEX directory), and an output path for the dense CSV. For example, to convert a matrix from a pipestance named sample123 in the current directory, either of the following commands would work:
# convert from MEX $ cellranger mat2csv sample123/outs/filtered_gene_bc_matrices sample123.csv # or, convert from H5 $ cellranger mat2csv sample123/outs/filtered_gene_bc_matrices_h5.h5 sample123.csv
You can then load sample123.csv into Excel.
WARNING: dense files can be very large and may cause Excel to crash, or even fail in mat2csv if your computer doesn't have enough memory. For example, a gene-barcode matrix from a human reference (~33k genes) with ~3k barcodes uses at least 200MB of disk space. Our 1.3 million mouse neuron dataset, if converted to this format, would use more than 60GB of disk space.