HOME  ›   pipelines

# HDF5 Feature Barcode Matrix Format

In addition to the MEX format, 10x also provides matrices in the Hierarchical Data Format (HDF5 or H5). H5 is a binary format that compresses and accesses data more efficiently than text formats such as MEX, which is useful when dealing with large datasets. H5 files are supported in both Python and R.

## File Format

The top level of the file contains a single HDF5 group, called matrix, and metadata stored as HDF5 attributes. Within the matrix group are datasets containing the dimensions of the matrix, the matrix entries, as well as the features and spot-barcodes associated with the matrix rows and columns, respectively.

ColumnTypeDescription
barcodesstringBarcode sequences and their corresponding library identifiers (for example, AAACGGGCAGCTCGAC-1). The library identifier is always -1 for spaceranger count runs of individual capture areas, and a small integer that identifies distinct capture areas in the output of spaceranger aggr
datauint32Nonzero UMI counts in column-major order
indicesuint32Zero-based row index of corresponding element in data
indptruint32Zero-based index into data / indices of the start of each column, that is the data corresponding to each barcode sequence
shapeuint64Tuple of (# rows, # columns) indicating the matrix dimensions

The matrix entries are stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, so that each barcode is represented by a contiguous chunk of data values.

The feature reference is stored as an HDF5 group called features, within the matrix group. See the documentation for the Molecule Info HDF5 file for details.

## HDF5 File Hierarchy

(root)
└── matrix [HDF5 group]
├── barcodes
├── data
├── indices
├── indptr
├── shape
└── features [HDF5 group]
├─ _all_tag_keys
├─ feature_type
├─ genome
├─ id
└─ name


See the documentation on Secondary Analysis in R.

There are two ways to load the H5 matrix into Python:

### 1. Using cellranger

This method requires adding spaceranger/lib/python to your $PYTHONPATH. For example, if you installed Space Ranger into /opt/spaceranger-1.1.0, then you can call the following script to set your PYTHONPATH: $ source spaceranger-1.1.0/sourceme.bash


Then in Python, the matrix can be loaded using the cellranger.matrix module as follows:

import cellranger.matrix as cr_matrix
filtered_matrix_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"


### 2. Using PyTables

This method is more involved, and requires the SciPy and PyTables libraries.

import collections
import scipy.sparse as sp_sparse
import tables

CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])

def get_matrix_from_h5(filename):
with tables.open_file(filename, 'r') as f:
mat_group = f.get_node(f.root, 'matrix')
matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)

feature_ref = {}
feature_group = f.get_node(mat_group, 'features')
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)