HOME  ›   pipelines

# Assembly Statistics

Upon the successful completion of a Supernova pipeline a number of useful statistics about the input data and the assembly are logged in outs/summary.csv, and the similar but more complete outs/assembly/stats/summary.json. We define below many of the various statistics contained there. Please also see this document on molecule length statistics.

In cases where metrics refer to kmers without specifying the size, the value of k is 48. Some of the metrics refer to the base graph. This is the directed graph created initially by Supernova, whose edges represent unbranched paths in a de Bruijn graph of k=48 kmers. It is also known as a unipath graph.

## Information provided on the Supernova command line

Name Description
sample_id Identifier of the sample.
bcfrac Fraction of barcodes in input reads to use. The bcfrac option is deprecated, so normally this will be 1.

## Metrics about the genome, computed from the data

Name Description
est_genome_size Estimated genome size in bases, computed from the distribution of kmers. For tested control samples, genome size estimates appear to be accurate to within about 10%, however it is theoretically possible that estimates could be 'way off', and we would like to see these cases. Single copy sex chromosomes are undercounted by a factor of two. This statistic could be confounded by microbiome sequences and contamination.
repfrac Genome repetitivity index: percent of read kmers, counted with multiplicity, whose depth exceeds twice the expected depth. Intended as an index of repetitivity, rather than a measure of ‘which fraction of the genome is repetitive’. This statistic could be confounded by microbiome sequences and contamination.
hetdist Mean distance in bases between heterozygous sites. May be overestimated in cases where alleles are so different that they assemble completely separately.
gc_percent Estimated GC content of the genome, computed from the assembly.
high_AT_index High AT index: predicted percent of kmers in genome that are ≥ 90% AT. Downbiased by presence in data: the probability that a true high AT kmer will be present in the data is less than that of an average true kmer.
dinucleotide_percent High dinucleotide fractions in the genome may correlate with long run times and assembly fragmentation. Here we estimate the fraction of genome bases that occur in a perfect dinucleotide repeat of length at least 20 bases (and not counting homopolymer repeats). This is computed from a random sample of reads. The computed values tend to be in the range of 100-200% of the values that would be computed directly from a reference sequence for the same genome. Much larger scale differences (up to three orders of magnitude) are observable between genomes.
ploidy_histogram For each base graph edge of length 1000-2000 kmers, we estimate its ploidy, meaning the number times which the sequence defined by the edge appears exactly in the genome, with homologous copies counted separately. We make our estimate based on depth of read coverage, and normalize it to put a peak at 2.0, corresponding to the assumption that the genome is diploid. Thus, though the true ploidies are normally integers, the estimates are floating point numbers. We round them to one digit after the decimal point, then count the number of edges for the ploidy values 0.0, 0.1, …, 6.0. These are stored in a vector that we call the ploidy histogram. Note: Sometimes there will also be a visible peak at 1.0, typically arising from highly heterozygous regions or single copy sex chromosomes. This peak could be smaller or larger than the peak at 2.0. Presence of other peaks could be a sign either of peculiar input data or defects in the normalization algorithm. The ploidy data is used by Supernova to estimate the genome size and mediates joining so as to prevent misassembly.

## Non-barcode metrics about the data, computed from the data

Name Description
likely_sequencers Illumina instrument model or models, inferred from flowcell id(s), with some uncertainty.
raw_coverage Raw coverage. Total bases in all reads, before trimming off barcode sequences, divided by the estimated genome size.
effective_coverage Estimated effective coverage. This is the estimated deduplicated coverage of an average base on the genome, counting both alleles. The reported value is the mean for base graph edges of length 1000-2000 kmers that appear to have ploidy two (see ploidy_histogram). Changed in Supernova 2.0.
effective_coverage_median Estimated effective coverage, median definition. In Supernova 1.2, this was the effective coverage metric. It is included here for backward compatibility, and may may deleted in subsequent versions of Supernova.
bases_per_read Mean read length after removing the first 23 bases from the beginning of read one of each pair (the 16-base 10x barcode plus 7 additional bases).
dup_perc Percentage of read pairs that are called duplicates. Two read pairs are declared duplicates of each other if the placements of their first reads on the initial (K=48, de Bruijn) assembly graphs are identical, and the first 5 bases of their second reads are the same. (On this basis, read pairs form naturally into 'duplicate groups'.) Because the barcode is not considered in this comparison, read pairs having different barcodes may be called duplicates. Thus, duplication could be overestimated, especially in genomes with high repeat content. See also interdup_perc.
interdup_perc Of reads declared duplicates (see dup_perc), the percentage that occur in duplicate groups comprising more than one distinct barcode.
median_ins_sz Estimated size of median inserts in library, as determined by read positions on the assembly graph.
placed_frac Fraction of reads placed uniquely on the final (phased) assembly.
proper_pairs_perc Of read pairs for which both reads are placed on the assembly, inferred percentage for which the reads have the correct orientation and separation.
q30_r2_perc Percentage of bases assigned quality ≥ 30 on read two.

## Barcode metrics about the data, computed from the data

Please see this document for more details on the lw_mean_mol_len and bridge metrics.

Name Description
lw_mean_mol_len Estimated length-weighted mean of molecule lengths, in bases, inferred from data.
p10 For an average point on the genome, the estimated number of molecules that extend 10 kb in both directions from that point, counting both alleles.
rpb_N50 N50 number of reads per 10x barcode.
valid_bc_perc Percent of reads assigned a valid 10x barcode.
bridge Mean number of barcodes whose reads contain two genomic 48-mers separated by one of several fixed distances. This is a vector.
bridge_50 Mean number of barcodes whose reads contain two genomic 48-mers separated by 50 kb.
bridge_1_50 The ratio bridge_1 / bridge_50.

## Output metrics

For purposes of computing these metrics, we form compressed scaffolds from the output of supernova mkoutput --style=pseudohap
by first replacing all stretches of Ns by single Ns, then discarding sequences of length < 10,000 bases.

Name Description
assembly_size The number of bases in the compressed scaffolds.
edge_N50 N50 size in bases of raw graph assembly edges.
contig_N50 After replacing Ns in compressed scaffolds by sequence breaks, this is the N50 size of the resulting sequences.
phase_block_N50 N50 size of phase blocks as defined in the pseudohap1 index file.
scaffold_N50 The N50 size of the compressed scaffolds.
scaffolds_10kb_plus The number of compressed scaffolds.
scaffolds_1kb_plus Same as scaffolds_10kb_plus, except that when computing compressed scaffolds, a threshold of 1,000 bases is used instead of 10,000 bases.
m10 The estimated percent of genomic kmers that are either missing from the assembly entirely or present only in scaffolds shorter than 10 kb. Each kmer counts once regardless of its multiplicity in the genome and thus this measure discounts repeats. It measures assembly disorganization. How it is computed. The m10 statistic is computed as the percent of base graph kmers in edges ≥ 100 bases that are missing from scaffolds > 10 kb in the final assembly. This does not include kmers that are completely missing from the data, although that fraction is expected to be very small for genomes having typical overall GC content. The statistic could include some noise.
checksum Assembly checksum. Used to confirm deterministic behavior.

## Computational performance metrics

Name Description
mem_peak Peak memory in GB: the maximum amount of memory used at any point by Supernova, as reported by the operating system. Because some Supernova stages base their memory usage on the total amount that is available, this statistic is not necessarily meaningful.
etime_h Wall clock time in hours for Supernova run.
read_rate_IO_1_threaded Single-threaded read rate in MB/sec as reported by stage IO. The read rate we observe is typically at least 250 MB/sec, however it is possible to have intermittent values lower by about an order of magnitude on well behaved systems that just happen to have high i/o load at a given time. Lower values are correlated with long run times.