Software  ›   pipelines

# Molecule info

The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid cell-barcode and valid UMI. This file is required by the R kit in order to produce read-subsampled gene-barcode matrices.

## Molecule info columns

Each dataset in the molecule info file corresponds to a single column. Each row corresponds to a unique (cell-barcode, UMI, gene) tuple. There is an additional row per (cell-barcode, UMI) tuple that aggregates information about reads that could not be confidently mapped to a gene.

ColumnTypeDescription
barcodeuint642-bit encoded processed cell-barcode sequence.
gem_groupuint8When a sample is split across multiple channels, the GEM group identifies which channel a barcode came from.
geneuint32An integer corresponding to the gene this putative molecule mapped to. This is a zero-based index into the barcodes.tsv file that accompanies the gene-barcode matrices. When set to the maximum gene index + 1, this row describes reads that did not map confidently to any gene.
umiuint322-bit encoded processed UMI sequence.
readsuint32Number of reads that confidently mapped to this putative molecule.
nonconf_mapped_readsuint32The number of reads with this cell-barcode and UMI that mapped to the genome but did not map confidently to any gene.
unmapped_readsuint32The number of reads with this cell-barcode and UMI that did not map to the genome.

## 2-bit encoding

The cell-barcode and UMI sequences are 2-bit encoded as follows:

• Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
• The least significant byte (LSB) contains the 3'-most nucleotides.
• The most significant bit is set if the sequence contained an 'N'.