10x Genomics
Chromium Single Cell Gene Expression

Cell Ranger1.2, printed on 04/01/2025

Molecule info

The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid cell-barcode and valid UMI. This file is required by the R kit in order to produce read-subsampled gene-barcode matrices. This HDF5 file contains data corresponding to the observed molecules, as well as data corresponding to the reference transcriptome that was used.

Molecule info columns

The following datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (cell-barcode, UMI, gene) tuple. There is an additional row per (cell-barcode, UMI) tuple that aggregates information about reads that could not be confidently mapped to a gene.

Column	Type	Description
`barcode`	uint64	2-bit encoded processed cell-barcode sequence.
`barcode_corrected_reads`	uint32	Number of reads within this putative molecule that had their cell-barcode corrected.
`conf_mapped_uniq_read_pos`	uint32	Number of unique read mapping positions associated with this putative molecule.
`gem_group`	uint8	Integer label that distinguishes data coming from distinct 10x GEM reactions (such as different channels or chips).
`gene`	uint32	A zero-based index into the `gene_ids` field (see next section), indicating the gene to which this putative molecule was mapped. When set to the maximum gene index + 1, this row describes reads that did not map confidently to any gene.
`genome`	uint32	A zero-based index into the `genome_ids` field (see next section), indicating the genome to which this putative molecule was mapped. When set to the maximum genome index + 1, this row describes reads that did not map confidently to any genome.
`nonconf_mapped_reads`	uint32	The number of reads with this cell-barcode and UMI that mapped to the genome but did not map confidently to any gene.
`reads`	uint32	Number of reads that confidently mapped to this putative molecule.
`umi`	uint32	2-bit encoded processed UMI sequence.
`umi_corrected_reads`	uint32	Number of reads within this putative molecule that had their UMI corrected.
`unmapped_reads`	uint32	The number of reads with this cell-barcode and UMI that did not map to the genome.

Molecule reference columns

In addition, the molecule info has a few datasets corresponding to the reference transcriptome(s) associated with this analysis.

Column	Type	Description
`gene_ids`	string	The Ensembl gene IDs contained in this reference. The `gene` column defined in the previous section is an index into this array.
`gene_names`	string	The common gene symbol associated with each of the above `gene_ids`.
`genome_ids`	string	The list of genomes represented in this reference. In most cases, this will be a single genome. The `genome` column defined in the previous section is an index into this array.

2-bit encoding

The cell-barcode and UMI sequences are 2-bit encoded as follows:

Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
The least significant byte (LSB) contains the 3'-most nucleotides.
The most significant bit is set if the sequence contained an 'N'.

10x Genomics
Chromium Single Cell Gene Expression

Molecule info

Molecule info columns

Molecule reference columns

2-bit encoding

About

Legal Notices

Resources

Headquarters

Social

10x GenomicsChromium Single Cell Gene Expression

Molecule info

Molecule info columns

Molecule reference columns

2-bit encoding

10x Genomics
Chromium Single Cell Gene Expression