# Molecule Info

The spaceranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, feature set(s), and barcode lists used for the analysis.

## Per-Molecule Columns

The following HDF5 datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (UMI, spot-barcode, feature) tuple indicating the feature best supported by the reads (i.e., including PCR duplicates) assigned to that UMI and spot-barcode. (If two or more features are tied for the number of supporting reads, as may happen for genes with very low mappability, then one row is output for each of the tied features.)

ColumnTypeDescription
barcode_idxuint64A zero-based index into the barcodes dataset (see next section), indicating the spot-barcode assigned to this putative molecule.
countuint32Number of reads associated with this putative molecule that were confidently mapped to the assigned feature.
feature_idxuint32A zero-based index into the feature list (see next section), indicating the feature to which this putative molecule was assigned.
gem_groupuint16Integer label that is currently one (1) for all Space Ranger output.
library_idxuint16A zero-based index into the library_info array (see next section) that distinguishes data coming from distinct Visium libraries.
umiuint322-bit encoded (see note below) processed (i.e. corrected) UMI sequence.

## Reference Columns

In addition, the molecule info file has datasets corresponding to information about the libraries, barcode list(s), and feature set(s) that were used in the analysis.

### Experiment Reference

At the top level of the HDF5 file hierarchy, the barcodes and library_info datasets provide information about the experiments contained in this analysis:

DatasetTypeDescription
barcodesstringA list of all spot-barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. Each spot-barcode sequence has a trailing digit that is currently one (1) in output generated from Space Ranger (e.g., AGAATGGTCTGCAT-1).
library_infostringA JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id, library_type, and gem_group

### Observed Spot-Barcodes

The HDF5 group barcode_info gives information regarding the barcodes that were detected as under the tissue during the analysis. This HDF5 group contains two columns

DatasetTypeDescription
genomesstringA list of all genome references used for gene expression libraries in this analysis.
pass_filteruint8A matrix with three columns that contains one row per passing spot-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx), where genome_idx is an index into the genomes dataset.

## HDF5 File Hierarchy

(root)
|
├─ barcode_idx
├─ barcode_info [HDF5 group]
│   ├─ genomes
│   └─ pass_filter
├─ barcodes
├─ count
├─ feature_idx
├─ features [HDF5 group]
│   ├─ _all_tag_keys
│   ├─ feature_type
│   ├─ genome
│   ├─ id
│   ├─ name
├─ gem_group
├─ library_idx
├─ library_info
├─ metrics [HDF5 group; see below]
└─ umi


## 2-bit Encoding

The UMI sequences are 2-bit encoded as follows:

• Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
• The least significant byte (LSB) contains the 3'-most nucleotides.

Please note that the spot-barcode sequences do not have this encoding; they are stored as plain strings in the library_info HDF5 group.

## Metrics HDF5 Group

The metrics group is intended for internal use by the Space Ranger pipeline; users should view metrics using the Space Ranger metrics outputs.

The attributes of metrics group contain pipeline metrics stored as serialized Python objects (using cPickle).

