Supernova1.2, printed on 11/21/2024
On the successful completion of a Supernova pipeline a number of useful statistics about the input data and the assembly are logged in outs/summary.csv. We define below the various statistics contained there.
abbreviation | name | definition |
---|---|---|
sample_id |
sample identifier | identifier of the sample |
nreads |
number of reads | number of reads provided as input, after downsampling if requested |
bases_per_read |
mean read length | mean read length after removing the first 23 bases from the beginning of read one of each pair (the 16-base 10x barcode plus 7 additional bases) |
dup_perc |
read duplication percent | percentage of read pairs that are duplicated, as determined by identical start and stop positions on the assembly graph |
hetdist |
distance between het sites | mean distance between heterozygous sites |
lw_mean_mol_len |
LWM molecule length | estimated length-weighted mean of molecule lengths |
median_ins_sz |
median insert size | estimated size of median inserts in library, as determined by read positions on the assembly graph |
placed_frac |
fraction of reads placed | fraction of reads uniquely placed on final (phased) assembly |
proper_pairs_perc |
proper pairs percent | of read pairs for which both reads are placed on the assembly, inferred fraction for which the reads have the correct orientation and separation |
q30_r2_perc |
read two q30 percent | fraction of bases assigned quality score ≥ 30 on read two |
rpb_N50 |
N50 reads per barcode | N50 number of reads per 10x barcode |
valid_bc_perc |
valid barcode percent | percent of reads assigned a valid 10x barcode |
Note that assembly size and N50 values are computed after removing scaffolds ≤ 10 kb and do not count N's.
abbreviation | name | definition |
---|---|---|
assembly_size |
assembly size | size of assembly in bases, counting only one allele |
edge_N50 |
N50 edge size | N50 size of raw graph assembly edges in bases |
contig_N50 |
N50 contig size | N50 size of contigs in bases |
phase_block_N50 |
N50 phase block size | N50 size of phase blocks in bases |
scaffold_N50 |
N50 scaffold size | N50 size of scaffolds in bases |
scaffolds_1kb_plus |
number of scaffolds ≥ 1 kb | number of scaffolds that are at least 1 kb long |
scaffolds_10kb_plus |
number of scaffolds ≥ 10 kb | number of scaffolds that are at least 10 kb long |
In addition to the metrics contained in the outs/summary.csv file, the outs/assembly/stats/ folder contains more fine-grained information about the input data and the assembly as discussed below.
File | Content |
---|---|
histogram_reads_per_barcode.json | histogram of the number of reads that share a common 10x barcode (bin size = 10) |
histogram_kmer_count.json | histogram of the frequency of kmers (K=48) amongst the reads, after removing potentially erroneous kmers based on quality scores, low multiplicity, or occurrence in only one barcode (histogram uses bin size = 1) |
kmer_spectrum.pdf | plot of the histogram in histogram_kmer_count.json, truncated to kmer frequencies in the range 0 - 100. |
histogram_molecules.json | histogram of the inferred length of input DNA that was used to generate Linked-Reads (in 1 kb bins, with minimum molecule length threshold of 1 kb) |
molecule_lengths.pdf | plot of the percentage of input DNA mass in 1 kb molecule length bins in the window of 1 - 300 kb. This plot is a length-weighted version of the histogram in the above file that has been smoothed using the LOWESS algorithm. |
histogram_edge.json | histogram of assembly graph edge lengths (in 1 kb bins) |
histogram_contig.json | histogram of contig lengths (in 1 kb bins) |
histogram_phase_block.json | histogram of phase block lengths in the assembly (in 1 kb bins) |
histogram_scaffold.json | histogram of scaffold lengths (in 10 kb bins) |