10x Genomics
Chromium De Novo Assembly

Supernova2.1, printed on 04/05/2025

Assembly Process

Analysis software for 10x Genomics linked read products is no longer supported. Raw data processing pipelines and visualization tools are available for download and can be used for analyzing legacy data from 10x Genomics kits in accordance with our end user licensing agreement without support.

Supernova generates phased, whole-genome de novo assemblies from a Chromium-prepared library.

Please see Achieving Success with De Novo Assembly and System Requirements before creating your Chromium libraries for assembly.

Supernova should be run using 38-56x coverage of the genome.
• Somewhat higher coverage is sometimes advantageous.
• Supernova will exit if it finds that coverage is far from the recommended range.
• Note that at most 2.14 billion reads are allowed.
• Please note that we have not extensively tested genomes larger than human, and any genome above approximately 4 GB should be considered experimental and is not supported.

Running Supernova has the following steps:

Run supernova mkfastq on the Illumina BCL output folder to generate FASTQ files.
Run supernova run separately for each sample to generate a whole genome de novo assembly for each.
Run supernova mkoutput in order to generate various styles of FASTA output for your assemblies.

For the following example, assume that the Illumina BCL output is in a folder named /sequencing/140101_D00123_0111_AHAWT7ADXX.

Run supernova mkfastq

First, follow the instructions on running supernova mkfastq to generate FASTQ files.

Set up the supernova command for de novo assembly

To run Supernova, you use the supernova run command, with the following parameters:

For help on which arguments to use to target a particular set of FASTQs, consult Running 10x Pipelines on FASTQ Files.

Argument	Description
`--id`	A unique run ID string: e.g. `sample345`
`--fastqs`	Path of the FASTQ folder generated by `supernova mkfastq` e.g. `/home/jdoe/runs/HAWT7ADXX/outs/fastq_path`
`--sample`	(optional) Can be used to select only a single sample of those specified in the sample sheet supplied to `mkfastq`. By default, all samples are used.
`--description`	(optional) Description of the data set. This will be included, along with the run ID string, in various output files.
`--maxreads`	Target using approximately this number of reads. To calculate the number of reads that you need, first start with an estimate of the genome size. If you don't know, make a guess. Supernova will estimate the genome size for you, and if you are far off, you can restart the assembly process. Next, set the number of reads so as to achieve 56x raw coverage. This is (genome size) x 56 / 150, assuming that your reads are 150 bases long. Coverage as low as 38x is acceptable, with some degradation in quality. Coverage significantly greater than 56x can sometimes help but can also be deleterious, depending on the dataset. The maximum allowed value for `--maxreads` is 2140000000 (2.14 billion). If you specify more reads than are available, Supernova will simply use all of your reads. If you specify fewer reads than are available, Supernova will uniformly, randomly choose reads from your input dataset. This pseudo-random downsampling is performed such that precisely the same input, on the same version of Supernova, should choose the same subset of reads. You may specify 'all' rather than a specific number, which will cause all reads to be used. Be careful not to exceed ~56x coverage or 2.14 billion reads.
`--accept-extreme-coverage`	(optional) Used to override Supernova’s Extreme Coverage Testing [see below]. Not recommended. This may cause supernova to run for a long time or crash.
`--localcores`	(optional) limits concurrent sections of Supernova to use the specified number of cores.
`--localmem`	(optional) limits memory use on shared systems where Supernova may attempt to use more resources than a user is allowed. Note that this is not a hard limit, but is used as a hint for high-memory portions of the assembly process that deliberately scale to the amount of memory installed in a system.

The following options are deprecated:

Argument	Description
`--indices`	[deprecated and optional; demux only] Sample indices associated with this sample, for use only with the older, `supernova demux` data preparation step. Comma-separated list of: index set plate wells: `SI-GA-A1,SI-GA-H12` index sequences: `TCGCCATA,GTATACAC`
`--lanes`	[deprecated and optional; demux only] Lanes associated with this sample. For use only with the older, `supernova demux` data preparation step.
`--bcfrac`	[deprecated and optional] Fraction of barcodes in the sample to use. This was intended to aid in the assembly of small genomes, but is no longer needed and may be harmful. Randomly chooses the specified fraction of all barcodes and retains only reads belonging to the chosen barcodes. Unbarcoded reads are selected randomly at the same rate. If `--maxreads` is set to something other than all, the data are examined after barcode subsampling and reads are randomly chosen to achieve the desired number.

Examples of how to determine the value for --maxreads

1. Suppose you estimate the size of your sample genome to be 800 Mb, and suppose your reads have length 150 bases (as recommended). Then 56x coverage would be
800,000,000 * 56 / 150 = 298,666,666 reads
so you could set --maxreads=298666666. If you have less than this many reads, supernova will use the number that is available.

2. Now suppose your genome has size 8000 Mb. Then 56x coverage would be
8,000,000,000 * 56 / 150 = 2,986,666,666 reads
but supernova only allows 2.14 billion reads, so you could set --maxreads=2140000000, which would give you 40x coverage, just barely within the recommended range. But you should note that supernova has only been tested on genomes of size up to about 3200 Mb. Note also that if you set --maxreads=all, and have more than 2,140,000,000 reads, Supernova will exit after the reads have been read in and counted.

3. Finally suppose your genome has size 80 Mb. Then 56x coverage would be
80,000,000 * 56 / 150 = 29,866,666 reads
so you could set --maxreads=29866666. Note that the smallest genome that supernova has been tested on is ~140 Mb.

Run supernova run

After determining these input arguments, call supernova run:

$ cd /home/jdoe/runs
$ supernova run --id=sample345 \
                --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path

Note that Supernova has been designed for stand-alone operation on a single, large system. Portions of the assembly process might scale to use all of the installed memory on a system. If you need to limit memory use by Supernova, e.g. on a shared system, please see the --localmem command line option. Likewise, parallel sections of code will use all cores on a system and this behavior can be limited with --localcores.

Following a set of preflight checks to validate input arguments, Supernova pipeline stages will begin to run:

supernova run
Copyright (c) 2016 10x Genomics, Inc.  All rights reserved.
-----------------------------------------------------------------------------
Martian Runtime - v2.3.3
 

Running preflight checks (please wait)...
2016-01-01 00:00:08 [runtime] (ready)           ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PREFLIGHT
2016-01-01 00:00:08 [perform] Serializing pipestance performance data.
2016-01-01 00:00:01 [runtime] (split_complete)  ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PREFLIGHT
2016-01-01 00:00:01 [runtime] (run:local)       ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PREFLIGHT.fork0.chnk0.main
2016-01-01 00:00:07 [runtime] (chunks_complete) ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PREFLIGHT
2016-01-01 00:00:10 [runtime] (join_complete)   ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PREFLIGHT
2016-01-01 00:00:11 [runtime] (ready)           ID.sample345.ASSEMBLER_CS._ASSEMBLER._ASSEMBLER_PREP._FASTQ_PREP_NEW.SETUP_CHUNKS
...

supernova run will use all of the sequence data available in the FASTQ folder, up to the limit imposed by the --maxreads option, described above.

If you are processing data prepared with the older, deprecated supernova demux process, you can also specify --indices and --lanes to further select the data to be processed. For new datasets, this selection is performed in the samplesheet provided to supernova mkfastq.

supernova run assumes that all of the cores on your system are available for its use, but you can use the --localcores option to limit this. Similarly, supernova run assumes that all of the memory on your system is available for its use. You can use --localmem to suggest limits, however memory utilization in certain sections of the code will scale with the size of the genome, the number of input reads and the quality of the data, and may exceed this limit.

The pipeline will create a new folder named with the sample ID you specified (e.g. /home/jdoe/runs/sample345) for its output. If this folder already exists, supernova run will assume it is an existing pipestance and attempt to resume running it.

Watching Supernova Progress

The standard output from supernova run displays lines that indicate the progress through pipeline stages as shown in Map of the Pipeline. The standard output will pause during individual stages with a message such as:

...
2016-01-03 00:00:01 [runtime] (run:local)       ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_DF.fork0.chnk0.main

and may appear to have stalled. However you should see a heartbeat message every 6 minutes, such as:

2016-01-03 00:06:01 [runtime] (update)          ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_DF.fork0 chunks_running

If you wish to monitor the progress of one of these stages, you can view the stage-specific standard output:

e.g.

$ cd /home/jdoe/runs
$ tail sample345/ASSEMBLER_CS/_ASSEMBLER/ASSEMBLER_DF/fork0/chnk0/_stdout

and likewise for other ASSEMBLER stages.

Extreme Coverage Testing

Roughly 20% of the way through the assembly process, Supernova estimates the raw coverage of the genome, and exits if the coverage is not between 30x and 85x. The reasons for this test are that:

Very low or very high coverage are likely to yield suboptimal results, and may cause Supernova to run unusually long or crash.
The actual recommended coverage is between 38x and 56x, although somewhat higher coverage is sometimes helpful. Thus the test should only catch cases that are significantly out of range.
It is possible to accidentally provide Supernova with an inappropriate number of input reads, in which case the mistake may be caught here, saving time.

If the coverage test fails, you have two options:

Restart using the --maxreads option, providing a lower value than you originally specified, matching the appropriate level of coverage. This is the recommended action.
Override the test by restarting with the option --accept-extreme-coverage. This is not recommended. If you do this, the assembly will continue at the point where it left off.

Output Files

A successful supernova run execution should conclude with a message that looks similar to this:

...
2016-01-03 00:00:01 [runtime] (run:local)       ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PR.fork0.join
2016-01-03 00:00:01 [runtime] (chunks_complete) ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PR
2016-01-03 00:00:03 [runtime] (join_complete)   ID.sample345.ASSEMBLER_CS._ASSEMBLER.ASSEMBLER_PR
 
Outputs:
- Run summary:        /home/jdoe/runs/sample345/outs/summary.csv
- Run report:         /home/jdoe/runs/sample345/outs/report.txt
- Raw assembly files: /home/jdoe/runs/sample345/outs/assembly
 
Pipestance completed successfully!
Saving pipestance info to sample345/sample345.mri.tgz

The output of the pipeline will be contained in a folder named with the sample ID you specified (e.g. sample345). The subfolder named outs will contain the main pipeline output files that are described in more detail in Output Overview.

Run supernova mkoutput for FASTA output

First, familiarize yourself with the representation of a genome assembly as a graph structure. Next, follow the instructions on running supernova mkoutput to generate FASTA files.

Frequently asked questions about running Supernova

Why did you deprecate the barcode subsampling option?: Two things changed. We generated data from several new small genomes, and made major changes to the Supernova algorithm. Contrary to our expectation, the new algorithm yielded in aggregate somewhat worse results when barcode subsampling was applied. In addition, using barcode subsampling wastes data and is confusing. So we no longer recommend its use.
Can I change the value of K?: Supernova uses multiple heuristics to define minimal overlaps between reads, not just a single value, although the assembly graph is represented using a single fixed K value of 48. Changing just one of these heuristics is unlikely to be effective, and might cause Supernova to crash. For most cases where Supernova results do not meet the needs of a given project, deeper changes would be needed to boost performance. These could include improvements to the input data (for example, molecule length), and could include changes to the algorithm architecture.
Should I trim reads before running Supernova?: We recommend against trimming reads. It complicates the process, and no advantage has been demonstrated. However, we have not carried out controlled experiments, and it is conceivable that trimming reads could be advantageous in some cases. If you have an example where trimming greatly improves assembly quality, please share it with us.

10x Genomics
Chromium De Novo Assembly

Assembly Process

Run supernova mkfastq

Set up the supernova command for de novo assembly

Examples of how to determine the value for --maxreads

Run supernova run

Watching Supernova Progress

Extreme Coverage Testing

Output Files

Run supernova mkoutput for FASTA output

Frequently asked questions about running Supernova

About

Legal Notices

Resources

Headquarters

Social

10x GenomicsChromium De Novo Assembly

Assembly Process

Run supernova mkfastq

Set up the supernova command for de novo assembly

Examples of how to determine the value for --maxreads

Run supernova run

Watching Supernova Progress

Extreme Coverage Testing

Output Files

Run supernova mkoutput for FASTA output

Frequently asked questions about running Supernova

10x Genomics
Chromium De Novo Assembly