# Achieving Success with De Novo Assembly

The Chromium de novo Assembly Solution enables the creation of diploid assemblies of human and nonhuman genomes. A single library of DNA from an individual organism is created and sequenced, and the Supernova software is run to yield a diploid assembly of that organism’s genome.

This is a markedly straightforward and low cost approach. This document will help you understand the factors involved in being successful with Supernova.

## Advantages of our de novo Assembly Solution

### Creating a genome assembly is easy

For most alternative technologies, assembling a genome can be a long and arduous process that starts with picking a mix of technologies, may involve upstream process such as inbreeding to even get the sample, and ends with a bioinformatician analyzing the data and often inventing new methods. By contrast, with the possible exception of DNA extraction (for certain sample types), our entire laboratory and computational process is turnkey and cookbook. We provide instructions for you to extract DNA, prepare and generate a single library, and sequence it. Finally, you run the assembly program, called Supernova. There is only one parameter which you supply to Supernova: the number of reads to use, which can be computed from your estimate of the genome size. There are no knobs to turn.

### It only needs a tiny amount of DNA

The required amount (approximately 1 ng, after quanting) is fundamentally enabling in cases where DNA is limited. For example, even for tiny organisms, such as a fruit fly, you can usually get enough DNA from a single individual. In addition to the cost savings, this avoids potential problems arising from practices such as inbreeding or mixing DNA from unrelated individuals.

### It is cost-effective and scalable

Compared to competing technologies, our de novo Assembly Solution costs a fraction as much (see Table 3 in Paajanen et al. 2017). This is because every step costs less. Collecting DNA costs less (see above). You only need one library. Supernova takes short read sequencing data as input, and in fact works well with Illumina’s lowest cost platforms, HiSeq X and NovaSeq. No computational expertise is required to run Supernova, and computational costs are about an order of magnitude lower than for assembly of long reads.

### It yields a high-quality, diploid representation of your genome

Long contigs in very long scaffolds can often be obtained, and contigs generally extend perfectly for 25-30 kb before encountering an error or gap (see results for test datasets). By contrast, long read assemblies often have indel errors interrupting perfect stretches at a higher rate. Supernova produces true diploid assemblies, thus representing both alleles, typically in very long phase blocks. This diploid representation is important, because in order to correctly understand the biology of an individual, it is necessary to observe both alleles!

## Making Supernova work for your genome of interest

Customers have successfully applied Supernova to a wide range of organisms. We characterize the scope of applicability of Supernova directly by providing the results of our current controlled testing, ranging over a large set of vertebrate, plant and insect datasets. Here are critical things you should know:

### Genome size

We have tested Supernova on organisms with genome sizes ranging from 140 Mb to 3.2 Gb. Slightly smaller and slightly larger genomes are also likely to work. Genomes larger than 4 GB should be considered experimental and are not supported. Use of more than 2.14 billion reads is not supported in the software, which further limits its use on very large genomes.

### Genome ploidy

We have tested Supernova on diploid genomes. Haploid genomes are likely to be fine, but have not been tested. The results for polyploid genomes are unknown and may depend on detailed sample characteristics (e.g. similarity between homeologs). We have tested Supernova on germline samples. We have not tested cancer samples.

### Using our data as guidance

Please closely examine the results of our test datasets to get a sense of how your sample might perform. You can also download our test datasets to conduct your own detailed evaluation. Please note that it is not possible in all cases to predict or guarantee expected results because there are many complex variables, from the molecule length to the sequencing data quality to the genome characteristics and more, that impact the results.

## Guidance to generate the right data

Getting a good assembly depends on factors all through the process. Here are some things to keep in mind before you even start following the laboratory protocols we provide.

### Start from a single individual

Do not mix wild individuals. Nominally clonal populations can be used as a DNA source, with caution, because such populations can retain significant heterozygosity. Assembly of bacterial populations is not supported.

### Long, undamaged DNA is required for high quality assembly

Generating it can be the hardest part of the process. Please see our sample prep protocol recommendations. For certain sample types (e.g. blood and cell lines), very long DNA can be made simply by following the directions. For other sample types, making long DNA is challenging and may require experimentation. Please share your experiences and in particular successful use of methods that would have value for other customers.

DNA length is challenging to accurately quantitate. The most definitive readout is only available after data have been generated and Supernova has been run. This readout represents the state of the DNA after library construction, where some damage may have occurred, depending on the rate of nicking in your DNA and other factors. Supernova measures DNA versus an intermediate assembly, and reports the length-weighted mean that it observes. Generally Supernova assembly quality improves with DNA length. DNA shorter than 50 kb can be problematic (depending on factors not knowable before assembly); DNA shorter than 20 kb is highly problematic. To provide context, please compare the molecule length value for your sample to those for our test samples.

### Follow the Chromium Genome laboratory instructions

This library construction is a defined process that is known to work.

< 1.6 0.6
1.6 - 3.2 interpolate between 0.6 and 1.2
3.2 - 4.0 1.2
4.0+ Not Supported. Load 1.2 if you want to experiment

We optimized Supernova for this read length because longer reads are much more expensive to generate. Use of longer reads is experimental and untested by us. Use of shorter reads is strongly discouraged because they have less power.

### Sequence to ~56x raw coverage

To determine the number of reads to be generated, take the estimated haploid genome size in bases, multiply by 56, and divide by 150. In some cases it can be advantageous to moderately increase coverage, however increasing coverage can also degrade assembly quality. Conversely, coverage can be reduced by about ⅓, corresponding to ~38x coverage, if required by your budget, with some degradation in assembly quality. Finally note that use of more than 231 - 1 (about two billion) reads is not supported at present.

### Use a tested sequencing instrument model

Our testing tables show results for three Illumina instrument models: HiSeq X, NovaSeq and HiSeq 2500 (rapid run). All of these work well (although instrument to instrument and run to run variability can be large). Many customers have used the HiSeq 4000, but assemblies generated using it have significantly lower contiguity. MiSeq should work but we have not tested it. Other instrument models have not been tested and are not recommended.

## How to get help

1. Create a support ticket by emailing [email protected]. In that email, please:

• Attach the ‘stats’ folder that Supernova generated
• Tell us what organism you sequenced
• Include information and estimates about genome characteristics, including ploidy, heterozygosity, and repeat content
• source material (e.g. one individual)
• preparation method
• gel image and expected molecule length
• QC methods
• quantity of DNA used

More generally, please let us know about your experiences with Supernova. We are very interested in what works for you and what does not, and things you've learned that might help other customers.

## Supernova publication

Please see Weisenfeld et al. 2017. Source code is available for interested customers, but we are not staffed to answer questions about it or provide support for modifications.

## Bibliography

1. Mohr, WM, et al. Apr. 2017. Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost. bioRxiv.
Uses Supernova 1.1 to obtain an assembly of the monk seal having scaffolds of size 22 Mb, then adds other data.

2. Weisenfeld NI, et al. May 2017. Direct determination of diploid genome sequences. Genome Res 27, 757-767.
This paper introduces Supernova 1.1 and tests it on seven human genomes.

3. Greer SU, et al. June 2017. Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases. Genome Med 9, 57.

"We sought to determine whether we could resolve and thus validate the rearranged structure by de novo assembly. We extracted all the sequence reads containing SV-specific barcodes from the linked read fastq files and then used these subset fastq files as input to the Supernova de novo assembly program to generate contig sequences... We visualized the structures of the resulting contigs by plotting the mapping position of each SV-specific read in the genome versus its mapping position in the contig."

4. Armstrong EE, et al. Sep. 2017. Entering the era of conservation genomics: Cost-effective assembly of the African wild dog genome using linked long reads. bioRxiv.

Assembles three African wild dogs using Supernova 1.1.

5. Paajanen P, et al. Oct. 2017. A critical comparison of technologies for a plant genome sequencing project. bioRxiv.
Evaluates several methods for sequencing and assembly of a potato genome, including Supernova 1.1.

"The 10x Genomics based assembly using SUPERNOVA was as easy to obtain as the DISCOVAR assembly. The two most remarkable features of this assembly are the low cost and input DNA requirement: for only slightly higher cost than a DISCOVAR assembly, and considerably less than with only one long mate-pair library, we obtained an assembly comparable to what one would expect from multiple long mate-pair libraries."

6. Aleman F. Oct. 2017. The Necessity of Diploid Genome Sequencing to Unravel the Genetic Component of Complex Phenotypes. Front Genet 8, 148.

An argument for diploid assembly.

7. Univ. of Oregon. Nov. 2017. Piliocolobus tephrosceles (Ugandan red Colobus). NCBI assembly archive.
Supernova 1.2 assembly having N50 scaffold size of 10 Mb.

8. China Agricultural Univ. Nov. 2017. Anas platyrhynchos platyrhynchos (common mallard). NCBI assembly archive.
Hybrid assembly including Supernova.

9. Jones SJM, et al. Dec. 2017. The Genome of the Beluga Whale (Delphinapterus leucas). Genes 8, 378.

Builds a Supernova 1.1 assembly having scaffolds of N50 size 16.8 Mb, then adds additional data.

10. USDA-ARS Center for Grain and Animal Health Research. Dec. 2017. Melanaphis sacchari (aphids). NCBI assembly archive.
Hybrid assembly including Supernova.

11. BGI. Dec. 2017. Physeter catodon (sperm whale). NCBI assembly archive.
Hybrid assembly including Supernova.

12. Jones SJ, et al. Dec. 2017. The Genome of the Northern Sea Otter (Enhydra lutris kenyoni). Genes 8, 379.

Uses Supernova 1.1 to obtain an assembly having scaffolds of N50 size 21 Mb, then adds other data.

13. DNAnexus. Dec. 2017. PacBio de novo genome assembly.

Shows computational cost estimate of $5,000 to$10,000 for assembling a human genome using PacBio.

14. Fallon RF, et al. Dec. 2017. Firefly genomes illuminate the origin and evolution of bioluminescence. bioRxiv.

Includes a Supernova 1.1 assembly of data from a single male firefly, Ignelater luminosus, collected in Puerto Rico.

15. BGI. Dec. 2017. Physeter catodon (sperm whale).

Assembles BGISEQ-500 data using SOAPdenovo + ARCS + Supernova.

16. Hulse-Kemp AM, et al. Jan. 2018. Reference Quality Assembly of the 3.5 Gb genome of Capsicum annuum from a Single Linked-Read Library. Hort Res 5.

Assembles the chili pepper genome using Supernova 1.1.

17. Torres MF, et al. Jan. 2018. Genus-wide sequencing supports a two-locus model for sex-determination in Phoenix. bioRxiv.

Exploits a Supernova assembly as part of their investigation of sex determination in date palms.

18. McDonnell Genome Institute - Washington Univ. School of Medicine. Feb. 2018. Terrapene mexicana triunguis (Three-toed box turtle). NCBI assembly archive.

Supernova 1.2.2 assembly of this turtle.

19. USDA-ARS. Feb. 2018. Vanessa tameamea (Kamehameha butterfly). NCBI assembly archive.

Assemblies this butterfly using Supernova 1.2.0.

20. National Institutes of Natural Sciences. Apr. 2018. Macaca fuscata fuscata (Japanese macaque). NCBI assembly archive.

Assembles this genome using Supernova 2.0.0.

21. Univ. of Pennsylvania. April 2018. Temnothorax curvispinosus. NCBI assembly archive.

Assembles the genome of this common forest ant species from the eastern USA using Supernova 2.0.1.

22. Cougar MB, et al. May 2018. A high quality genome for Mus spicilegus, a close relative of house mice with unique social and ecological adaptations. G3 8.

Assembles the genome of this mouse with Supernova 1.1.5.

23. Perera OP, et al. May 2018. CRISPR/Cas9 mediated high efficiency knockout of the eye color gene Vermillion in Helicoverpa zea (Boddie). PLoS One 13.

Uses a Supernova 1.1.5 assembly as part of studying genome editing in the bollworm.

24. Mostovoy Y. May 2018. [assemblies of NA19440, NA19068, HG03115, NA21125, NA20587, NA18552, HG03838, HG00851, HG01971, HG00353, HG00250]. NCBI assembly archive.

Assemblies of eleven human cell line genomes using Supernova 1.1.1 and Bionano.

25. Ando T, et al. June 2018. Repeated inversions at the pannier intron drive diversification of intraspecific colour patterns of ladybird beetles. bioRxiv.

Assembles four ladybird beetle samples using Supernova 2.0.0, and fills gaps with PacBio reads.