# Molecule info

The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid cell-barcode and valid UMI. This file is required by the R kit in order to produce read-subsampled gene-barcode matrices. This HDF5 file contains data corresponding to the observed molecules, as well as data corresponding to the reference transcriptome that was used.

## Molecule info columns

The following datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (cell-barcode, UMI, gene) tuple. There is an additional row per (cell-barcode, UMI) tuple that aggregates information about reads that could not be confidently mapped to a gene.

ColumnTypeDescription
barcodeuint642-bit encoded processed cell-barcode sequence.
barcode_corrected_readsuint32Number of reads within this putative molecule that had their cell-barcode corrected.
conf_mapped_uniq_read_posuint32Number of unique read mapping positions associated with this putative molecule.
gem_groupuint8Integer label that distinguishes data coming from distinct 10x GEM reactions (such as different channels or chips).
geneuint32A zero-based index into the gene_ids field (see next section), indicating the gene to which this putative molecule was mapped. When set to the maximum gene index + 1, this row describes reads that did not map confidently to any gene.
genomeuint32A zero-based index into the genome_ids field (see next section), indicating the genome to which this putative molecule was mapped. When set to the maximum genome index + 1, this row describes reads that did not map confidently to any genome.
nonconf_mapped_readsuint32The number of reads with this cell-barcode and UMI that mapped to the genome but did not map confidently to any gene.
readsuint32Number of reads that confidently mapped to this putative molecule.
umiuint322-bit encoded processed UMI sequence.
umi_corrected_readsuint32Number of reads within this putative molecule that had their UMI corrected.
unmapped_readsuint32The number of reads with this cell-barcode and UMI that did not map to the genome.

## Molecule reference columns

In addition, the molecule info has a few datasets corresponding to the reference transcriptome(s) associated with this analysis.

ColumnTypeDescription
gene_idsstringThe Ensembl gene IDs contained in this reference. The gene column defined in the previous section is an index into this array.
gene_namesstringThe common gene symbol associated with each of the above gene_ids.
genome_idsstringThe list of genomes represented in this reference. In most cases, this will be a single genome. The genome column defined in the previous section is an index into this array.

## 2-bit encoding

The cell-barcode and UMI sequences are 2-bit encoded as follows:

• Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
• The least significant byte (LSB) contains the 3'-most nucleotides.
• The most significant bit is set if the sequence contained an 'N'.