HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium De Novo Assembly

Generating Output

Once your assembly has completed (yielding binary data structures), use the command supernova mkoutput to generate a FASTA file representing your assembly. Read about the full details below the usage example.

supernova mkoutput \
        --style=raw|megabubbles|pseudohap|pseudohap2 \
        --asmdir=/path/to/outs/assembly \
        --outprefix=output_filename_prefix \
        [ --minsize=N ] \
        [ --headers=short|full ]

Required Style Option

There are four styles of FASTA output:

--style=raw

The raw style, identified in FASTA records as style=1, represents every edge in the assembly as a FASTA record (seen as red segments in the above illustration). These include microbubble arms and gaps. Gaps are represented by an edge consisting of a sequence of Ns roughly approximating the size of the gap: Gaps captured by read pairs are represented by 100 Ns; other gaps are longer. In addition, where cycles are present in the graph, an arbitrary path is chosen through the cycle, and the sequence for that path is suffixed by 10 Ns. Bubbles and gaps generally appear once per 10-20 kb. Raw graph records are roughly two orders of magnitude shorter than megabubble arms. For each edge in the raw graph, there is also an edge written to the FASTA file representing the reverse complement sequence.

This is the most detailed of the output styles, in that it is the only one to include reverse complement edges and not flatten bubbles. It is also the only one in which gaps remain as their own distinct edges.

--style=megabubbles

In this style, identified in FASTA records as style=2, each megabubble arm corresponds to a FASTA record, as does each intervening sequence.

Bubbles are flattened by selecting the branch having highest coverage. Gaps are joined to adjacent sequences, resulting in longer edges that represent the gaps internally by sequences of Ns. Reverse complement edges are not represented.

--style=pseudohap

The pseudohap style, identified in FASTA records as style=3, generates a single record per scaffold. For example, the seven red edges on top of the illustration for --style=megabubbles, corresponding to seven FASTA records, are combined into a single FASTA record. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles.

As in the megabubble graph, microbubbles are flattened, gaps are joined to adjacent sequences, and reverse complement edges are not represented.

--style=pseudohap2

This style, identified in FASTA records as style=4, is like the pseudohap option, except that for each scaffold, two ‘parallel’ pseudohaplotypes are created and placed in separate FASTA files. Records in these files are parallel to each other.

Required Non-Style Options

--asmdir=PATH/TO/outs/assembly

This should be set to the path of the assembly output directory created by Supernova. This will be the directory outs/assembly underneath where your pipeline is stored.

--outprefix=OUTPUT_FILE_PREFIX

This is a filename prefix for assembly output. This can include a relative or absolute path. For instance, specifying --outprefix=/x/y/z will create a FASTA file in the directory /x/y called z.fasta. Note that for the pseudohap2 option, which creates two files, these would be called z.1.fasta and z.2.fasta

Optional Parameters

--minsize=N

For output styles other than raw, you may choose to print only those FASTA records longer than a given size. In raw mode all FASTA records are printed.

--minsize=n [specify minimum FASTA record size in bases, default: 1000]
--index

For the pseudohap2 output style only, this option causes an index file (suffixed .idx instead of .fasta) to be written out for each of the two pseudohap files. The index files give the coordinates in bases, for each record, of each transition between megabubble and non-megabubble (nominally homozygous) sequences. Consider the image below:

A record in the pseudohap index file corresponding to the top path in the graph might look like this:

359 0 1665 3604556 3622153 9159962 9813322 10398144 10429438 

The first number on the line is the record number as seen in the corresponding pseudohap FASTA file. There are eight numbers remaining on the line corresponding to the eight transitions in the image (numbered 0 to 7).

The first coordinate is always zero and the last coordinate is always the length of the entire record in bases, with N's included.

So from the file we see that the sequence between 0 and 1 is 1665 bases long and that the top arm of the first megabubble, between 1 and 2, starts at base 1665 on the line and is 3604556-1665 = 3,602,891 bases long.

The second index file corresponding to the second pseudo-haplotype, and the bottom path through the graph, will have a record 359 with the same number of entries.

A scaffold can have any number of megabubbles, even zero. In that case the scaffold is entirely unphased and appears identically in both pseudohap files. The index file lines will each contain three numbers.

Note that the index file can be used to calculate the phase block N50 (see metrics).

--headers=<mode>

By default, FASTA header lines show only the start and end edges of the path associated to the record. Optionally, the entire path associated with each edge may be displayed. This will yield some huge header lines, which may break other software.

ModeBehavior
--headers=fullverbose output; all edge ids are written.
--headers=shortonly first and last edge ids shown; this is the default.

A --headers=short record might look like this:

>55 edges=4..6 left=15 right=88 ver=1.7 style=1
ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT
TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT

And the same example shown above with --headers=full might have the following header line:

>55 edges=4,15,33,7,6 left=15 right=88 ver=1.7 style=1
ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT
TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT


Frequently asked questions about output format

I can't find some known heterozygous sites in the megabubble FASTA. How is that possible?
In the non-raw output formats, if a region has low enough heterozygosity that it cannot be phased, then the region is presented as a single FASTA sequence (rather than as two megabubble arms), and bubbles in the region are squashed, in effect choosing one allele at random at heterozygous loci. Both alleles are still present in the raw FASTA output. In addition there are at present cases where small megabubbles are similarly collapsed in the FASTA.
How are cycles handled in the FASTA output files?

In the raw output format, cycles are shown directly in the graph, as single 'loop' edges or more complicated cyclic structures.

Internally, Supernova captures cyclic structures within longer scaffolds as 'cycle gap edges', which are single edges encoding a more complex graph structure. In the raw graph output format, these cycle gap edges are expanded out, and thus seen either as single 'loop' edges or more complicated cyclic structures consisting of multiple edges. In the non-raw output formats, starting with Supernova version 2.0, each cycle gap edge is broken up into separate FASTA records, each representing a scaffold, and the gap edge itself is replaced by 10 Ns.