Cell Ranger DNA1.1 (latest), printed on 01/31/2023
cellranger-dna produces a HDF5 format file that contains all of the key outputs of the pipeline. HDF5 files provide an easy way to store compressed data and can be read by tools like h5py. The general structure is similar to a dictionary, with a series of keys storing values. Each of the sections below represent a key in the HDF5 file.
annotation: The gene annotation version included as part of the reference.
assembly: The assembly version of the sequence included as part of the reference.
library_id: A dataset of two columns, linking GEM well suffix to library ID.
organism: The reference organism.
pipeline: One of
reanalyze, denoting which pipeline generated the file.
pipeline_version: The version of the pipeline used to generate the file.
reference_path: The absolute path of the reference used in the pipeline run.
sample_desc: The sample description provided during the pipeline invocation.
sample_id: The sample ID provided during the pipeline invocation.
[-128, 126]. Copy number calling is performed across all mappable bins of the genome, and then imputed in unmappable regions based on neighboring bins. Negative values denote imputation—when imputation is successful the value in the bin will be in
[-126, -1]representing an imputed copy number in the range
[1, 126]. If the neighboring bins during imputation have different copy numbers, then the value of
-128, or "no call" is used. If a copy number of
0is imputed, then the value
-127will be assigned.
bin_size: The size of the bins used for CNV calling. Defaults to 20kb.
chroms: The names of the primary contigs in the reference used.
num_bins_per_chrom: The number of
bin_sizebins in each chromosome in
num_cells: The number of barcodes determined to be cells. Will match the size of
num_chroms: The number of primary contigs in the reference used. Will match the size of
num_nodes: The number of nodes in the hierarchical tree generated by the clustering. Will always be equal to
(num_cells * 2) - 1because the tree is binary.
gc_fraction: The fraction of GC content in each bin.
is_mappable: A boolean array of mappability calls for each bin. A bin is mappable if at least 90% of simulated reads generated from a given bin map back uniquely.
n_fraction: The fraction of unknown bases (Ns) in the given bin.
Z: This is the linkage matrix output from SciPy clustering with complete linkage.
is_cell_in_group: An adjacency matrix for the cell-node graph. A
num_cells - 1 x num_cellsbit matrix, where each X-axis value is an internal node and each Y-axis value is a leaf node (cell). This matrix has a value 1 in row x and column y when cell y is a member of internal node x and has a value 0 otherwise.
heterogeneity: This key has values for each primary contig in the reference, and has shape of
num_cells - 1 x num_bins_per_chromosome. For each internal node, the heterogeneity of cells within this cluster is calculated as
1 - (fraction majority), where
fraction majorityis the fraction of cells that agree with the most common copy number call.