Cell Ranger2.0, printed on 03/25/2023
In addition to MEX format, we also provide matrices in the Hierarchical Data Format (abbreviated HDF5 or H5). H5 is a binary format that can compress and access data much more efficiently than text formats such as MEX - which is especially useful when dealing with large datasets.
For more information on the format, see the Introduction to HDF5.
H5 files are supported in both Python and R.
The top-level of the file contains a list of HDF5 groups, one per genome. So for example if your data contains both
mm10, there will be two top-level groups. The file hierarchy would look something like this:
root └── hg19 ├── barcodes ├── data ... └── shape └── mm10 ├── barcodes ├── data ... └── shape
Most of the time, your analysis will only contain a single genome.
Within each genome group, the matrix is stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, such that each barcode is represented by a contiguous chunk of data values.
|string||Barcode sequences and their corresponding gem groups (e.g. |
|uint32||Nonzero UMI counts in column-major order|
|string||Gene symbols (e.g. |
|string||Ensembl gene IDs (e.g. |
|uint32||Row index of corresponding element in |
|uint32||Index into |
|uint64||Tuple of (n_rows, n_columns)|
The latest version of cellrangerRkit is required to load a gene-barcode matrix from H5. For example, running the following code loads a filtered human (hg19) gene-barcode matrix into R:
library(cellrangerRkit) genome <- "hg19" pipestance_path <- "/opt/sample345" gene_bc_matrix <- load_cellranger_matrix_h5(pipestance_path, genome=genome)
There are two ways to load the H5 matrix into Python:
This method requires that you add cellranger/lib/python to your $PYTHONPATH.
E.g. if you installed Cell Ranger into /opt/cellranger-2.0.2, then you would call:
$ export PYTHONPATH=/opt/cellranger-2.0.2/lib/python:$PYTHONPATH
Then in Python, call:
import cellranger.matrix as cr_matrix filtered_matrix_h5 = "/opt/sample345/outs/filtered_gene_bc_matrices_h5.h5" genome = "hg19" gene_bc_matrices = cr_matrix.GeneBCMatrices.load_h5(filtered_matrix_h5) gene_bc_matrix = gene_bc_matrices.get_matrix(genome)
This method is a bit more involved, and requires the SciPy and PyTables libraries.
import collections import scipy.sparse as sp_sparse import tables GeneBCMatrix = collections.namedtuple('GeneBCMatrix', ['gene_ids', 'gene_names', 'barcodes', 'matrix']) def get_matrix_from_h5(filename, genome): with tables.open_file(filename, 'r') as f: try: group = f.get_node(f.root, genome) except tables.NoSuchNodeError: print "That genome does not exist in this file." return None gene_ids = getattr(group, 'genes').read() gene_names = getattr(group, 'gene_names').read() barcodes = getattr(group, 'barcodes').read() data = getattr(group, 'data').read() indices = getattr(group, 'indices').read() indptr = getattr(group, 'indptr').read() shape = getattr(group, 'shape').read() matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape) return GeneBCMatrix(gene_ids, gene_names, barcodes, matrix) filtered_matrix_h5 = "/opt/sample345/outs/filtered_gene_bc_matrices_h5.h5" genome = "hg19" gene_bc_matrix = get_matrix_from_h5(filtered_matrix_h5, genome)