Cell Ranger ATAC2.1, printed on 11/23/2024
In addition to MEX format, Cell Ranger ATAC also provides matrices in the Hierarchical Data Format (abbreviated HDF5 or H5). H5 is a binary format that can compress and access data much more efficiently than text formats such as MEX - which is especially useful when dealing with large datasets.
For more information on the format, see the Introduction to HDF5.
H5 files are supported in Python and we recommend the user to load h5 files in Python using one of the two ways described further below.
Cell Ranger ATAC produces two flavors of feature-barcode matrices: the peak-barcode matrix and the transcription factor-barcode matrix. Consult Specifying Input FASTQ Files for 10x Pipelines. |
The top-level of the file contains the matrix HDF5 group, with datasets describing the matrix listed under it.The file hierarchy would look something like this:
matrix ├── barcodes ├── data ├── features │ ├── _all_tag_keys │ ├── derivation │ ├── feature_type │ ├── genome │ ├── id │ └── name ├── indices ├── indptr └── shape
Within each genome group, the matrix is stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, such that each barcode is represented by a contiguous chunk of data values.
Column | Type | Description |
---|---|---|
barcodes | string | Barcode sequences and their corresponding gem groups (e.g. AAACGGGCAGCTCGAC-1 ) |
data | uint32 | Nonzero UMI counts in column-major order |
features/_all_tag_keys | string | Feature attributes other than id, name, feature_type . For 2.1.0, this is simply genome, derivation . |
features/derivation | string | Mechanism by which the feature was derived from primary feature types |
features/feature_type | string | Peaks or Motifs |
features/genome | string | Genome associated with each feature (e.g. hg19 ) |
features/id | string | Peak or motif name built into the reference (e.g. chr1:1000-2000 , SPI1_HUMAN.MA0080.4 ) |
features/name | string | Peak or common motif name (e.g. chr1:1000-2000 , SPI1_HUMAN.MA0080.4 ) |
indices | uint32 | Row index of corresponding element in data |
indptr | uint32 | Index into data / indices of the start of each column |
shape | uint64 | Tuple of (n_rows, n_columns) |
There are two ways to load the H5 matrix into Python:
This method requires that you add cellranger-atac/lib/cellranger/lib/python
to your $PYTHONPATH
.
E.g. if you installed Cell Ranger ATAC into /opt/cellranger-atac-2.1.0
, then you would call:
$ export PYTHONPATH=/opt/cellranger-atac-2.1.0/cellranger-atac-cs/2.1.0/lib/cellranger/lib/python:$PYTHONPATH
Then in Python, call:
import cellranger.matrix as cr_matrix filtered_matrix_h5 = "/opt/sample345/outs/filtered_peak_bc_matrix.h5" peak_matrix = cr_matrix.CountMatrix.load_h5_file(filtered_matrix_h5) matrix = peak_matrix.m
This method is more involved, and requires the SciPy and PyTables libraries.
import collections import scipy.sparse as sp_sparse import tables FeatureBCMatrix = collections.namedtuple('FeatureBCMatrix', ['ids', 'names', 'barcodes', 'matrix']) def get_matrix_from_h5(filename, genome): with tables.open_file(filename, 'r') as f: try: group = f.get_node(f.root, 'matrix') except tables.NoSuchNodeError: print "Matrix group does not exist in this file." return None feature_group = getattr(group, 'features').read() ids = getattr(feature_group, 'id').read() names = getattr(feature_group, 'name').read() barcodes = getattr(group, 'barcodes').read() data = getattr(group, 'data').read() indices = getattr(group, 'indices').read() indptr = getattr(group, 'indptr').read() shape = getattr(group, 'shape').read() matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape) return FeatureBCMatrix(ids, names, barcodes, matrix) filtered_matrix_h5 = "/opt/sample345/outs/filtered_tf_bc_matrix.h5" tf_bc_matrix = get_matrix_from_h5(filtered_matrix_h5) matrix = tf_bc_matrix.m