HOME  ›   pipelines

# HDF5 Output

cellranger-dna produces a HDF5 format file that contains all of the key outputs of the pipeline. HDF5 files provide an easy way to store compressed data and can be read by tools like h5py. The general structure is similar to a dictionary, with a series of keys storing values. Each of the sections below represent a key in the HDF5 file.

A dictionary containing various parameters and pipestance context.
1. annotation: The gene annotation version included as part of the reference.
2. assembly: The assembly version of the sequence included as part of the reference.
3. library_id: A dataset of two columns, linking GEM well suffix to library ID.
4. organism: The reference organism.
5. pipeline: One of cnv, aggr, or reanalyze, denoting which pipeline generated the file.
6. pipeline_version: The version of the pipeline used to generate the file.
7. reference_path: The absolute path of the reference used in the pipeline run.
8. sample_desc: The sample description provided during the pipeline invocation.
9. sample_id: The sample ID provided during the pipeline invocation.

cell_barcodes
A list of all of the barcodes that were called as cells. A explanation of the cell calling algorithm can be found here.

cnvs
A dictionary mapping each primary contig in the reference to an integer copy number call matrix. Each row of the matrix represents a single cell or group of cells; there are 2N-1 rows in the matrix in a sample with N cells. Each column is a 20 kb genomic bin on the primary contig. Rows 0 to N-1 correspond to the single cell copy number calls, and rows N to 2N-2 represent the groups of cells defined by the hierarchical clustering as defined in the SciPy linkage matrix format. The values in this matrix exist in the interval [-128, 126]. Copy number calling is performed across all mappable bins of the genome, and then imputed in unmappable regions based on neighboring bins. Negative values denote imputation—when imputation is successful the value in the bin will be in [-126, -1] representing an imputed copy number in the range [1, 126]. If the neighboring bins during imputation have different copy numbers, then the value of -128, or "no call" is used. If a copy number of 0 is imputed, then the value -127 will be assigned.

constants
A dictionary containing constants used in the pipeline.
1. bin_size: The size of the bins used for CNV calling. Defaults to 20kb.
2. chroms: The names of the primary contigs in the reference used.
3. num_bins_per_chrom: The number of bin_size bins in each chromosome in chroms.
4. num_cells: The number of barcodes determined to be cells. Will match the size of cell_barcodes.
5. num_chroms: The number of primary contigs in the reference used. Will match the size of chroms.
6. num_nodes: The number of nodes in the hierarchical tree generated by the clustering. Will always be equal to (num_cells * 2) - 1 because the tree is binary.

genome_tracks
A dictionary containing values related to the reference used. These include GC content, mappability and unknown bases. Each of these objects contain keys for each primary contig in the reference used, and the values of those keys are the track values for that contig.
1. gc_fraction: The fraction of GC content in each bin.
2. is_mappable: A boolean array of mappability calls for each bin. A bin is mappable if at least 90% of simulated reads generated from a given bin map back uniquely.
3. n_fraction: The fraction of unknown bases (Ns) in the given bin.

raw_counts, normalized_counts
These two dictionaries contain keys for each primary contig in the reference, and represent the raw and normalized counts for each leaf and node in the hierarchical clustering tree. The format for these matrices follow the format for cnvs, representing the raw read counts for cells or clusters of cells, as well as the normalized read counts after GC bias correction.

tree
The output of the hierarchical clustering is stored in the original format (Z) and in a more accessible parsed format (is_cell_in_group), as well as the heterogeneity for each internal node.
1. Z: This is the linkage matrix output from SciPy clustering with complete linkage.
2. is_cell_in_group: An adjacency matrix for the cell-node graph. A num_cells - 1 x num_cells bit matrix, where each X-axis value is an internal node and each Y-axis value is a leaf node (cell). This matrix has a value 1 in row x and column y when cell y is a member of internal node x and has a value 0 otherwise.
3. heterogeneity: This key has values for each primary contig in the reference, and has shape of num_cells - 1 x num_bins_per_chromosome. For each internal node, the heterogeneity of cells within this cluster is calculated as 1 - (fraction majority), where fraction majority is the fraction of cells that agree with the most common copy number call.

per_cell_summary_metrics, summary_metrics
Datasets used internally to facilitate cellranger-dna aggr and cellranger-dna reanalyze pipelines. It is not recommended to depend on these datasets, as they may be deprecated at any time.

• 1.0
• Cell Ranger DNA v1.1 (latest)