Chromium De Novo Assembly
Supernova, printed on 11/29/2020
- Fix an issue where the pipeline would crash (usually in the ASSEMBLER_IO stage) on CPUs without AVX support.
The core assembly algorithms remain unchanged, however, the results may vary slightly from Supernova 2.0.1 and some metrics are now measured differently (see below).
- Supernova now estimates the genome size approximately 20% of the way
through the assembly process and exits if the inferred raw coverage
is very far from the recommended range of 38x to 56x. This is done
to avoid long assembly runs at unintentionally low or high coverage.
An option is provided to resume assembly at this point, although this
action is generally not recommended. To avoid accidental use of an arbitrary default value for
--maxreads, it is now a required argument. The maximum allowed value has been changed to 2.14 billion.
- Add metrics that estimate the GC and dinucleotide content of genomes,
which can be useful for intepreting results.
- The metrics assembly_size, contig_N50, phase_block_N50, scaffold_N50,
scaffold_1kb_plus and scaffold_10kb_plus are now computed in such
a fashion that their values may be reproduced exactly from Supernova
FASTA output. As a result, the reported metrics may vary slightly from metrics generated by the previous version of Supernova.
- The "ploidy histogram" has been removed from the
summary.csv file, but is
still available in the
- Fix an issue where both alleles at some loci were present in both pseudohap output files.
- Fix an integer overflow error that sometimes occurred in stages
ASSEMBLER_PR after printing "building elocs".
- Fix a crash that sometimes occurred in stage
ASSEMBLER_PR after printing "building new assembly".
- Fix deadlock occurring when memory allocation failed in a critical block.
- Improve performance at several points in the code where highly repetitive
genomes could cause slow execution.
supernova mkoutput is now strictly single-threaded. Previously, a
very small portion of the process was multi-threaded, leading to issues
on multi-user systems where a given user may be allocated a restricted
number of cores.
- The publicly available assemblies have been replaced by Supernova 2.1.0 assemblies.
The source code for Supernova now has the MIT license.
- Add new genome metric: ploidy_histogram
- Truncate large metadata files when generating a tarball for upload to 10x, rather than omitting them.
- Fix an issue where
supernova mkoutput would emit both reverse-complement and forward versions of the same pseudohap scaffold. It is safe to use 2.0.1 to generate new FASTA files from 2.0.0 assemblies. The new files show “ver=1.10” in the header.
- Fix an issue where unzipped FASTQ files were no longer accepted as input.
Failures, Crashes, and Forensics
- Fix a number of failures in
ASSEMBLER_M2 (viz. error messages regarding "TrimAdapter") related to unexpected read lengths. We still recommend against trimming or otherwise pre-processing Linked-Read data prior to running Supernova.
- Fix a number of crashes in the
ASSEMBLER_ML stages (viz. “remove duplicate edges”, “computing division points” or “translating pairs to matches”) related to runs with very high coverage depth. We still recommend running Supernova with coverage between 38x and 56x for your genome.
- Fix a bug that caused the
ASSEMBLER_ACP stage to crash.
- Fix a potential pipeline failure in
ASSEMBLER_DF (viz. “Map/Reduce operation has failed at pass 0”) related to users using a different number of cores than we tested. Note that many parts of the Supernova pipeline are not capable of using more than 32 cores.
- Fix a condition where
ASSEMBLER_PR could exit prematurely (viz. “unneeded vertices”).
- Fix a potential infinite loop in
ASSEMBLER_CL (viz. “identifying redundant edges”).
- Certain failures to memory map files now print extensive diagnostic information.
- Some users who run the Supernova executables from a Lustre filesystem experience exec() failures (viz. “Re-exec to adjust stack size failed"). The software is now more robust and in case of failure will provide remedial guidance.
- Fix an issue that caused unnecessary virtual memory use in the
ASSEMBLER_DF stage. Note that Supernova uses memory-mapped files and therefore needs virtual address space (VMEM) that is generally larger than the maximum resident set size (RSS) of the process.
- Fix an issue that caused
ASSEMBLER_PR to run very slowly on certain genomes (viz. “indexing closure paths”).
- Barcode subsampling is now deprecated. This also simplifies the workflow and
reduces the amount of sequencing that is required.
- We provide a new 'optimized salting out' protocol that can be used to easily
prepare DNA from a wide range of sample types and which we demonstrate on single
- Memory usage has increased on average by about 10%. Nevertheless, of
20 test assemblies,
18 ran on a server having 256 GB RAM, and the remaining 2 ran on a
server having 512 GB RAM.
The 8 human assemblies in the set
(all at about 56x coverage) ran on a server having 256 GB RAM, however it is possible
that for stochastic reasons, some human datasets may require somewhat more memory.
- The mean run time for Supernova has increased, however the variance is lower.
Several extreme run time phenotypes are gone.
- Molecule length is
now more accurately computed.
A plot is now provided showing the inferred distribution of molecule lengths and in
comparison to control samples. This replaces the previous estimation, histogram_molecules.json.
- Kmer histogram and a pdf plot is reintroduced.
- Several new metrics about the genome and data (including genome size) are now
- Total wallclock time for assembly is now shown in the text summary file.
- An alert is now issued if the estimated genome size seems too low or too high.
- An alert is now issued if coverage seems too low or too high.
- Alerts are now shown in the text summary file.
- The representation of cycles in FASTA output has been improved.