10x Genomics
Chromium De Novo Assembly
Supernova2.1, printed on 12/11/2024
Assembly Statistics
|
Analysis software for 10x Genomics linked read products is no longer supported. Raw data processing pipelines and visualization tools are available for download and can be used for analyzing legacy data from 10x Genomics kits in accordance with our end user licensing agreement without support.
|
Upon the successful completion of a Supernova pipeline a number of useful
statistics about the input data and the assembly are logged in
outs/summary.csv, and the similar but more complete
outs/assembly/stats/summary.json.
We define below many of the various statistics contained there. Please also see
this document on molecule length statistics.
In cases where metrics refer to kmers without specifying the size, the value of k is 48.
Some of the metrics refer to the base graph. This is the directed graph
created initially by Supernova, whose edges represent unbranched paths in a
de Bruijn graph of k=48 kmers. It is also known as a unipath graph.
Information provided on the Supernova command line
Name |
Description |
sample_id |
Identifier of the sample. |
bcfrac |
Fraction of barcodes in input reads to use. The bcfrac option is deprecated, so normally this will be 1. |
Metrics about the genome, computed from the data
Name |
Description |
est_genome_size |
Estimated genome size in bases, computed from the distribution of kmers. For tested control samples, genome size estimates appear to be accurate to within about 10%, however it is theoretically possible that estimates could be 'way off', and we would like to see these cases. Single copy sex chromosomes are undercounted by a factor of two. This statistic could be confounded by microbiome sequences and contamination. |
repfrac |
Genome repetitivity index: percent of read kmers, counted with multiplicity, whose depth exceeds twice the expected depth. Intended as an index of repetitivity, rather than a measure of ‘which fraction of the genome is repetitive’. This statistic could be confounded by microbiome sequences and contamination. |
hetdist |
Mean distance in bases between heterozygous sites. May be overestimated in cases where alleles are so different that they assemble completely separately. |
gc_percent |
Estimated GC content of the genome, computed from the assembly. |
high_AT_index |
High AT index: predicted percent of kmers in genome that are ≥ 90% AT. Downbiased by presence in data: the probability that a true high AT kmer will be present in the data is less than that of an average true kmer. |
dinucleotide_percent |
High dinucleotide fractions in the genome may correlate with long run times and assembly fragmentation. Here we estimate the fraction of genome bases that occur in a perfect dinucleotide repeat of length at least 20 bases (and not counting homopolymer repeats). This is computed from a random sample of reads. The computed values tend to be in the range of 100-200% of the values that would be computed directly from a reference sequence for the same genome. Much larger scale differences (up to three orders of magnitude) are observable between genomes. |
ploidy_histogram |
For each base graph edge of length 1000-2000 kmers, we estimate its ploidy, meaning the number times which the sequence defined by the edge appears exactly in the genome, with homologous copies counted separately. We make our estimate based on depth of read coverage, and normalize it to put a peak at 2.0, corresponding to the assumption that the genome is diploid. Thus, though the true ploidies are normally integers, the estimates are floating point numbers. We round them to one digit after the decimal point, then count the number of edges for the ploidy values 0.0, 0.1, …, 6.0. These are stored in a vector that we call the ploidy histogram. Note: Sometimes there will also be a visible peak at 1.0, typically arising from highly heterozygous regions or single copy sex chromosomes. This peak could be smaller or larger than the peak at 2.0. Presence of other peaks could be a sign either of peculiar input data or defects in the normalization algorithm. The ploidy data is used by Supernova to estimate the genome size and mediates joining so as to prevent misassembly. |
Non-barcode metrics about the data, computed from the data
Name |
Description |
likely_sequencers |
Illumina instrument model or models, inferred from flowcell id(s), with some uncertainty. |
nreads |
Number of reads provided as input, after downsampling if requested. |
raw_coverage |
Raw coverage. Total bases in all reads, before trimming off barcode sequences, divided by the estimated genome size. |
effective_coverage |
Estimated effective coverage. This is the estimated deduplicated coverage of an average base on the genome, counting both alleles. The reported value is the mean for base graph edges of length 1000-2000 kmers that appear to have ploidy two (see ploidy_histogram). Changed in Supernova 2.0. |
effective_coverage_median |
Estimated effective coverage, median definition. In Supernova 1.2, this was the effective coverage metric. It is included here for backward compatibility, and may may deleted in subsequent versions of Supernova. |
bases_per_read |
Mean read length after removing the first 23 bases from the beginning of read one of each pair (the 16-base 10x barcode plus 7 additional bases). |
dup_perc |
Percentage of read pairs that are called duplicates. Two read pairs are declared duplicates of each other if the placements of their first reads on the initial (K=48, de Bruijn) assembly graphs are identical, and the first 5 bases of their second reads are the same. (On this basis, read pairs form naturally into 'duplicate groups'.) Because the barcode is not considered in this comparison, read pairs having different barcodes may be called duplicates. Thus, duplication could be overestimated, especially in genomes with high repeat content. See also interdup_perc. |
interdup_perc |
Of reads declared duplicates (see dup_perc), the percentage that occur in duplicate groups comprising more than one distinct barcode. |
median_ins_sz |
Estimated size of median inserts in library, as determined by read positions on the assembly graph. |
placed_frac |
Fraction of reads placed uniquely on the final (phased) assembly. |
proper_pairs_perc |
Of read pairs for which both reads are placed on the assembly, inferred percentage for which the reads have the correct orientation and separation. |
q30_r2_perc |
Percentage of bases assigned quality ≥ 30 on read two. |
Barcode metrics about the data, computed from the data
Please see this document for more details on the lw_mean_mol_len and bridge metrics.
Name |
Description |
lw_mean_mol_len |
Estimated length-weighted mean of molecule lengths, in bases, inferred from data. |
p10 |
For an average point on the genome, the estimated number of molecules that extend 10 kb in both directions from that point, counting both alleles. |
rpb_N50 |
N50 number of reads per 10x barcode. |
valid_bc_perc |
Percent of reads assigned a valid 10x barcode. |
bridge |
Mean number of barcodes whose reads contain two genomic 48-mers separated by one of several fixed distances. This is a vector. |
bridge_50 |
Mean number of barcodes whose reads contain two genomic 48-mers separated by 50 kb. |
bridge_1_50 |
The ratio bridge_1 / bridge_50. |
Output metrics
For purposes of computing these metrics, we form compressed scaffolds from the output of supernova mkoutput --style=pseudohap
by first replacing all stretches of Ns by single Ns, then discarding sequences of length < 10,000 bases.
Name |
Description |
assembly_size |
The number of bases in the compressed scaffolds. |
edge_N50 |
N50 size in bases of raw graph assembly edges. |
contig_N50 |
After replacing Ns in compressed scaffolds by sequence breaks, this is the N50 size of the resulting sequences. |
phase_block_N50 |
N50 size of phase blocks as defined in the pseudohap1 index file. |
scaffold_N50 |
The N50 size of the compressed scaffolds. |
scaffolds_10kb_plus |
The number of compressed scaffolds. |
scaffolds_1kb_plus |
Same as scaffolds_10kb_plus , except that when computing compressed scaffolds, a threshold of 1,000 bases is used instead of 10,000 bases. |
m10 |
The estimated percent of genomic kmers that are either missing from the assembly entirely or present only in scaffolds shorter than 10 kb. Each kmer counts once regardless of its multiplicity in the genome and thus this measure discounts repeats. It measures assembly disorganization. How it is computed. The m10 statistic is computed as the percent of base graph kmers in edges ≥ 100 bases that are missing from scaffolds > 10 kb in the final assembly. This does not include kmers that are completely missing from the data, although that fraction is expected to be very small for genomes having typical overall GC content. The statistic could include some noise. |
checksum |
Assembly checksum. Used to confirm deterministic behavior. |
Computational performance metrics
Name |
Description |
mem_peak |
Peak memory in GB: the maximum amount of memory used at any point by Supernova, as reported by the operating system. Because some Supernova stages base their memory usage on the total amount that is available, this statistic is not necessarily meaningful. |
etime_h |
Wall clock time in hours for Supernova run. |
read_rate_IO_1_threaded |
Single-threaded read rate in MB/sec as reported by stage IO. The read rate we observe is typically at least 250 MB/sec, however it is possible to have intermittent values lower by about an order of magnitude on well behaved systems that just happen to have high i/o load at a given time. Lower values are correlated with long run times. |
read_rate_IO_10_threaded |
Ten-threaded read rate in MB/sec as reported by stage IO. The read rate we observe is typically at least 1500 MB/sec. See comments for read_rate_IO_1_threaded. Lower values are correlated with long run times. |
read_rate_DF_1_threaded |
Single-threaded read rate in MB/sec as reported by stage DF. See comments for read_rate_IO_1_threaded. Like read_rate_IO_1_threaded but checked at a different point during execution. |
Auxiliary statistics
In addition to the metrics contained in the outs/summary.csv file,
the outs/assembly/stats/ folder contains more fine-grained information about the input data and the assembly as discussed below.
File |
Contents |
histogram_reads_per_barcode.json |
Histogram of the number of reads that share a common 10x barcode (bin size = 10). |
mol_length_dist.pdf |
Plot of the predicted abundance of molecule lengths. See Molecule Length Calculation. |
histogram_edge.json |
Histogram of assembly graph edge lengths (in 1 kb bins). |
histogram_contig.json |
Histogram of contig lengths (in 1 kb bins). |
histogram_phase_block.json |
Histogram of phase block lengths in the assembly (in 1 kb bins). |
histogram_scaffold.json |
Histogram of scaffold lengths (in 10 kb bins). |