HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium De Novo Assembly

Assembly Statistics

On the successful completion of a Supernova pipeline a number of useful statistics about the input data and the assembly are logged in outs/summary.csv. We define below the various statistics contained there.

Input statistics

abbreviation name definition
sample_id sample identifier identifier of the sample
nreads number of reads number of reads provided as input, after downsampling if requested
bases_per_read mean read length mean read length after removing the first 23 bases from the beginning of read one of each pair (the 16-base 10x barcode plus 7 additional bases)
dup_perc read duplication percent percentage of read pairs that are duplicated, as determined by identical start and stop positions on the assembly graph
hetdist distance between het sites mean distance between heterozygous sites
lw_mean_mol_len LWM molecule length estimated length-weighted mean of molecule lengths
median_ins_sz median insert size estimated size of median inserts in library, as determined by read positions on the assembly graph
placed_frac fraction of reads placed fraction of reads uniquely placed on final (phased) assembly
proper_pairs_perc proper pairs percent of read pairs for which both reads are placed on the assembly, inferred fraction for which the reads have the correct orientation and separation
q30_r2_perc read two q30 percent fraction of bases assigned quality score ≥ 30 on read two
rpb_N50 N50 reads per barcode N50 number of reads per 10x barcode
valid_bc_perc valid barcode percent percent of reads assigned a valid 10x barcode

Output statistics

Note that assembly size and N50 values are computed after removing scaffolds ≤ 10 kb and do not count N's.

abbreviation name definition
assembly_size assembly size size of assembly in bases, counting only one allele
edge_N50 N50 edge size N50 size of raw graph assembly edges in bases
contig_N50 N50 contig size N50 size of contigs in bases
phase_block_N50 N50 phase block size N50 size of phase blocks in bases
scaffold_N50 N50 scaffold size N50 size of scaffolds in bases
scaffolds_1kb_plus number of scaffolds ≥ 1 kb number of scaffolds that are at least 1 kb long
scaffolds_10kb_plus number of scaffolds ≥ 10 kb number of scaffolds that are at least 10 kb long

Auxiliary statistics

In addition to the metrics contained in the outs/summary.csv file, the outs/assembly/stats/ folder contains more fine-grained information about the input data and the assembly as discussed below.

FileContent
histogram_reads_per_barcode.json histogram of the number of reads that share a common 10x barcode (bin size = 10)
histogram_kmer_count.json histogram of the frequency of kmers (K=48) amongst the reads, after removing potentially erroneous kmers based on quality scores, low multiplicity, or occurrence in only one barcode (histogram uses bin size = 1)
kmer_spectrum.pdf plot of the histogram in histogram_kmer_count.json, truncated to kmer frequencies in the range 0 - 100.
histogram_molecules.json histogram of the inferred length of input DNA that was used to generate Linked-Reads (in 1 kb bins, with minimum molecule length threshold of 1 kb)
molecule_lengths.pdf plot of the percentage of input DNA mass in 1 kb molecule length bins in the window of 1 - 300 kb. This plot is a length-weighted version of the histogram in the above file that has been smoothed using the LOWESS algorithm.
histogram_edge.json histogram of assembly graph edge lengths (in 1 kb bins)
histogram_contig.json histogram of contig lengths (in 1 kb bins)
histogram_phase_block.json histogram of phase block lengths in the assembly (in 1 kb bins)
histogram_scaffold.json histogram of scaffold lengths (in 10 kb bins)