HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium De Novo Assembly

Generating Output

Once your assembly has completed (yielding binary data structures), use the command supernova mkoutput to generate a FASTA file representing your assembly. Read about the full details below the usage example.

supernova mkoutput \
        --style=raw|megabubbles|pseudohap|pseudohap2 \
        --asmdir=/path/to/outs/assembly \
        --outprefix=output_filename_prefix \
        [ --minsize=N ] \
        [ --headers=short|full ]

Required Style Option

There are four styles of FASTA output:

--style=raw

The raw style, identified in FASTA records as style=1, represents every edge in the assembly as a FASTA record (seen as red segments in the above illustration). These include microbubble arms and gaps. Gaps are represented by a record consisting of a sequence of Ns roughly approximating the size of the gap: Gaps captured by read pairs are represented by 100 Ns; other gaps are longer. Bubbles and gaps generally appear once per 10-20 kb and the raw graph records are roughly two orders of magnitude shorter than megabubble arms. The raw graph contains both forward and reverse complement edges, and both are written as separate records to the FASTA file.

This is the most detailed of the output styles, in that it is the only one that includes reverse complement edges, does not flatten bubbles, and does not give special treatment to cycles. It is also the only style in which gaps remain as their own distinct records.

--style=megabubbles

In this style, identified in FASTA records as style=2, each megabubble arm corresponds to a FASTA record, as does each intervening sequence.

Bubbles are flattened by selecting the branch having highest coverage. Gaps are joined to adjacent sequences, resulting in longer edges that represent the gaps internally by sequences of Ns. Reverse complement edges are not represented.

--style=pseudohap

The pseudohap style, identified in FASTA records as style=3, generates a single record per scaffold. For example, the seven red edges on top of the illustration for --style=megabubbles, corresponding to seven FASTA records, are combined into a single FASTA record. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles.

As in the megabubble graph, microbubbles are flattened, gaps are joined to adjacent sequences, and reverse complement edges are not represented.

--style=pseudohap2

This style, identified in FASTA records as style=4, is like the pseudohap option, except that for each scaffold, two ‘parallel’ pseudohaplotypes are created and placed in separate FASTA files. Records in these files are parallel to each other.

Required Non-Style Options

--asmdir=PATH/TO/outs/assembly

This should be set to the path of the assembly output directory created by Supernova. This will be the directory outs/assembly underneath where your pipeline is stored.

--outprefix=OUTPUT_FILE_PREFIX

This is a filename prefix for assembly output. This can include a relative or absolute path. For instance, specifying --outprefix=/x/y/z will create a FASTA file in the directory /x/y called z.fasta. Note that for the pseudohap2 option, which creates two files, these would be called z.1.fasta and z.2.fasta

Optional Parameters

--minsize=N

For output styles other than raw, you may choose to print only those FASTA records longer than a given size. In raw mode all FASTA records are printed.

--minsize=n [specify minimum FASTA record size in bases, default: 1000]
--headers=<mode>

By default, FASTA header lines show only the start and end edges of the path associated to the record. Optionally, the entire path associated with each edge may be displayed. This will yield some huge header lines, which may break other software.

ModeBehavior
--headers=fullverbose output; all edge ids are written.
--headers=shortonly first and last edge ids shown; this is the default.

A --headers=short record might look like this:

>55 edges=4..6 left=15 right=88 ver=1.7 style=1
ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT
TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT

And the same example shown above with --headers=full might have the following header line:

>55 edges=4,15,33,7,6 left=15 right=88 ver=1.7 style=1
ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT
TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT


Frequently asked questions about output format

I can't find some known heterozygous sites in the megabubble FASTA. How is that possible?
In the non-raw output formats, if a region has low enough heterozygosity that it cannot be phased, then the region is presented as a single FASTA sequence (rather than as two megabubble arms), and bubbles in the region are squashed, in effect choosing one allele at random at heterozygous loci. Both alleles are still present in the raw FASTA output. In addition there are at present cases where small megabubbles are similarly collapsed in the FASTA.
How are cycles handled in the FASTA output files?

In the raw output format, cycles are shown directly in the graph, as single 'loop' edges or more complicated cyclic structures.

Internally, Supernova captures cyclic structures within longer scaffolds as 'cycle gap edges', which are single edges encoding a more complex graph structure. In the raw graph output format, these cycle gap edges are expanded out, and thus seen either as single 'loop' edges or more complicated cyclic structures consisting of multiple edges. In the non-raw output formats, starting with Supernova version 2.0, each cycle gap edge is broken up into separate FASTA records, each representing a scaffold, and the gap edge itself is replaced by 10 Ns.