HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Gene Expression

Molecule Info

The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene or Feature Barcode. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, feature set(s), and barcode lists used for the analysis.

Per-Molecule Columns

The following HDF5 datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (UMI, cell-barcode, feature) tuple indicating the feature best supported by the reads (i.e., including PCR duplicates) assigned to that UMI and cell-barcode.

ColumnTypeDescription
barcode_idxuint64A zero-based index into the barcodes dataset (see next section), indicating the cell-barcode assigned to this putative molecule.
countuint32Number of reads associated with this putative molecule that were confidently mapped to the assigned feature.
feature_idxuint32A zero-based index into the feature list (see next section), indicating the feature to which this putative molecule was assigned.
gem_groupuint16Integer label that distinguishes data coming from distinct 10x GEM reactions (such as different channels or chips).
library_idxuint16A zero-based index into the library_info array (see next section) that distinguishes data coming from distinct 10x libraries (for example, gene expression and Feature Barcode). There may be multiple libraries associated with a single GEM well.
umiuint322-bit encoded (see note below) processed (i.e. corrected) UMI sequence.

Reference Columns

In addition, the molecule info file has datasets corresponding to information about the libraries, barcode list(s), and feature set(s) that were used in the analysis.

Experiment Reference

At the top level of the HDF5 file hierarchy, the barcodes and library_info datasets provide information about the experiments contained in this analysis:

DatasetTypeDescription
barcodesstringA list of all cell-barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. To distinguish between identical cell-barcode sequences observed in different GEM reactions, the GEM well is appended to the end of the cell-barcode sequence (e.g., AGAATGGTCTGCAT-1).
library_infostringA JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id, library_type, and gem_group

Observed Cell-Barcodes

The HDF5 group barcode_info gives information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains two columns

DatasetTypeDescription
genomesstringA list of all genome references used for gene expression libraries in this analysis.
pass_filteruint64A matrix with three columns that contains one row per passing cell-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx), where genome_idx is an index into the genomes dataset. For Feature Barcode libraries, genome_idx will correspond to the genome reference used for the gene expression data from the specified cell-barcode.

Feature Reference

The HDF5 group features contains information regarding the feature reference(s) used for the analysis. The datasets within the features group represent columns in a table containing one row per feature. Values in the feature_idx column described in the previous section provide indices into the rows of this hypothetical table.

In addition to the columns described below, user-specified tags may also be present. The dataset _all_tag_keys contains a list of user-specified tags as well as built-in tags (genome, pattern, read, and/or sequence).

ColumnTypeDescription
feature_typestringThe type of feature reference to which this feature belongs (Gene Expression, CRISPR Guide Capture, Antibody Capture, or Custom).
genomestringThe genome reference for a given feature (e.g., "GRCh38" or "mm10"). For non-gene expression features, this entry is an empty string.
idstringThe unique id corresponding to this feature (for example, an Ensembl gene ID).
namestringA human-readable name associated with this feature (for example, the common name associated with a gene).
patternstring[Feature Barcode only] Specifies how to extract the Feature Barcode sequence from the read.
readstring[Feature Barcode only] Specifies which RNA sequencing read ("R1" or "R2") contains the Feature Barcode.
sequencestring[Feature Barcode only] Nucleotide barcode sequence associated with this feature (e.g., a sgRNA protospacer sequence).

The features group also contains an HDF5 group target_sets used for Targeted Gene Expression samples. When a target gene panel is present, indices of the target genes are stored inside target_sets, in an HDF5 dataset named after the target gene panel (e.g., "Human Gene Signature").

HDF5 File Hierarchy

(root)
├─ barcode_idx
├─ barcode_info	[HDF5 group]
│   ├─ genomes
│   └─ pass_filter
├─ barcodes
├─ count
├─ feature_idx
├─ features	[HDF5 group]
│   ├─ _all_tag_keys
│   ├─ target_sets [for Targeted Gene Expression]
│   │    └─ [target set name]
│   ├─ feature_type
│   ├─ genome
│   ├─ id
│   ├─ name
│   ├─ pattern [Feature Barcode only]
│   ├─ read [Feature Barcode only]
│   └─ sequence [Feature Barcode only]
├─ gem_group
├─ library_idx
├─ library_info
├─ metrics_json [HDF5 dataset; see below]
└─ umi

2-bit Encoding

The UMI sequences are 2-bit encoded as follows:

Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info HDF5 group.

Metrics JSON HDF5 Group

The metrics_json dataset contains pipeline metrics in JSON format that are used internally by Cell Ranger. Users should view metrics using the Cell Ranger metrics outputs.