Cell Ranger1.2, printed on 10/16/2024
The cellranger pipeline outputs two types of gene-barcode matrices.
Type | Description |
---|---|
Unfiltered gene-barcode matrices | Contains every barcode from fixed list of known-good barcode sequences. This includes background and non-cellular barcodes. |
Filtered gene-barcode matrices | Contains only detected cellular barcodes. |
The cellranger pipeline generates a gene-barcode matrix per species. Each matrix is stored in Market Exchange Format (MEX). It also contains TSV files with genes and barcode sequences corresponding to row and column indices, respectively. For example, if cellranger is run with a human reference, the matrices output may look like:
$ cd /home/jdoe/runs/sample345/outs $ tree filtered_gene_bc_matrices filtered_gene_bc_matrices └── hg19 ├── barcodes.tsv ├── genes.tsv └── matrix.mtx 2 directories, 3 files
Genes correspond to row indices. For each gene, its gene ID and gene name are stored in the first and second column of the genes.tsv file, respectively.
$ head filtered_gene_bc_matrices/hg19/genes.tsv ENSG00000243485 MIR1302-10 ENSG00000237613 FAM138A ENSG00000186092 OR4F5 ENSG00000238009 RP11-34P13.7 ENSG00000239945 RP11-34P13.8 ENSG00000237683 AL627309.1 ENSG00000239906 RP11-34P13.14 ENSG00000241599 RP11-34P13.9 ENSG00000228463 AP006222.2 ENSG00000237094 RP4-669L17.10
Gene ID corresponds to gene_id in the annotation field of the reference GTF. Similarly, gene name corresponds to gene_name in the annotation field of the reference GTF. If no gene_name field is present in the reference GTF, gene name is equivalent to gene ID.
For multi-species experiments, gene IDs and names are prefixed with the genome name to avoid name collisions between genes of different species e.g. GAPDH becomes hg19_GAPDH and Gm15816 becomes mm10_Gm15816.
Barcode sequences correspond to column indices.
$ head filtered_gene_bc_matrices/hg19/barcodes.tsv AAACATACAAAACG-1 AAACATACAAAAGC-1 AAACATACAAACAG-1 AAACATACAAACGA-1 AAACATACAAAGCA-1 AAACATACAAAGTG-1 AAACATACAACAGA-1 AAACATACAACCAC-1 AAACATACAACCGT-1 AAACATACAACCTG-1
Each barcode sequence includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCAT-1
More details on the barcode sequence format are available in the barcoded BAM section.
R and Python support MEX format, and sparse matrices can be used for more efficient manipulation.
The cellrangerRkit library is needed to load a gene-barcode matrix into R. Assuming you have R installed and Rscript
on your $PATH
, this can be installed by running the following command:
$ cellranger install-rkit
For example, running the following code loads the filtered human (hg19) gene-barcode matrix into R:
library(cellrangerRkit) genome <- "hg19" gene_bc_matrix <- load_cellranger_matrix("/opt/sample345", genome=genome)
To load a gene-barcode matrix from another species, you will need to edit the genome variable above. For example, to load the filtered mouse (mm10) gene-barcode matrix, you would set genome <- "mm10" and rerun the script above.
The csv, os and scipy.io libraries are recommended for loading a gene-barcode matrix into Python.
import csv import os import scipy.io genome = "hg19" matrices_dir = "/opt/sample345/outs/filtered_gene_bc_matrices" human_matrix_dir = os.path.join(matrices_dir, genome) mat = scipy.io.mmread(os.path.join(human_matrix_dir, "matrix.mtx")) genes_path = os.path.join(human_matrix_dir, "genes.tsv") gene_ids = [row[0] for row in csv.reader(open(genes_path), delimiter="\t")] gene_names = [row[1] for row in csv.reader(open(genes_path), delimiter="\t")] barcodes_path = os.path.join(human_matrix_dir, "barcodes.tsv") barcodes = [row[0] for row in csv.reader(open(barcodes_path), delimiter="\t")]
Similarly with R to load a gene-barcode matrix from another species, you will need to edit the genome variable above.