Supernova2.0, printed on 12/21/2024
Once your assembly has completed (yielding binary data structures), use the command supernova mkoutput
to generate a FASTA file representing your assembly. Read about the full details below the usage example.
supernova mkoutput \ --style=raw|megabubbles|pseudohap|pseudohap2 \ --asmdir=/path/to/outs/assembly \ --outprefix=output_filename_prefix \ [ --minsize=N ] \ [ --headers=short|full ]
There are four styles of FASTA output:
The raw style, identified in FASTA records as style=1
, represents every edge in the assembly as a FASTA record (seen as red segments in the above illustration). These include microbubble arms and gaps. Gaps are represented by a record consisting of a sequence of Ns roughly approximating the size of the gap: Gaps captured by read pairs are represented by 100 Ns; other gaps are longer. Bubbles and gaps generally appear once per 10-20 kb and the raw graph records are roughly two orders of magnitude shorter than megabubble arms. The raw graph contains both forward and reverse complement edges, and both are written as separate records to the FASTA file.
This is the most detailed of the output styles, in that it is the only one that includes reverse complement edges, does not flatten bubbles, and does not give special treatment to cycles. It is also the only style in which gaps remain as their own distinct records.
In this style, identified in FASTA records as style=2
, each megabubble arm corresponds to a FASTA record, as does each intervening sequence.
Bubbles are flattened by selecting the branch having highest coverage. Gaps are joined to adjacent sequences, resulting in longer edges that represent the gaps internally by sequences of Ns. Reverse complement edges are not represented.
The pseudohap style, identified in FASTA records as style=3
, generates a single record per scaffold. For example, the seven red edges on top of the illustration for --style=megabubbles
, corresponding to seven FASTA records, are combined into a single FASTA record. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles.
As in the megabubble graph, microbubbles are flattened, gaps are joined to adjacent sequences, and reverse complement edges are not represented.
This style, identified in FASTA records as style=4
, is like the pseudohap option, except that for each scaffold, two ‘parallel’ pseudohaplotypes are created and placed in separate FASTA files. Records in these files are parallel to each other.
This should be set to the path of the assembly output directory created by Supernova. This will be the directory outs/assembly
underneath where your pipeline is stored.
This is a filename prefix for assembly output. This can include a relative or absolute path. For instance, specifying --outprefix=/x/y/z
will create a FASTA file in the directory /x/y
called z.fasta
. Note that for the pseudohap2 option, which creates two files, these would be called z.1.fasta
and z.2.fasta
For output styles other than raw, you may choose to print only those FASTA records longer than a given size. In raw mode all FASTA records are printed.
--minsize=n [specify minimum FASTA record size in bases, default: 1000]
By default, FASTA header lines show only the start and end edges of the path associated to the record. Optionally, the entire path associated with each edge may be displayed. This will yield some huge header lines, which may break other software.
Mode | Behavior |
---|---|
--headers= full | verbose output; all edge ids are written. |
--headers= short | only first and last edge ids shown; this is the default. |
A --headers=short
record might look like this:
>55 edges=4..6 left=15 right=88 ver=1.7 style=1 ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT
And the same example shown above with --headers=full
might have the following header line:
>55 edges=4,15,33,7,6 left=15 right=88 ver=1.7 style=1 ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT
In the raw output format, cycles are shown directly in the graph, as single 'loop' edges or more complicated cyclic structures.
Internally, Supernova captures cyclic structures within longer scaffolds as 'cycle gap edges', which are single edges encoding a more complex graph structure. In the raw graph output format, these cycle gap edges are expanded out, and thus seen either as single 'loop' edges or more complicated cyclic structures consisting of multiple edges. In the non-raw output formats, starting with Supernova version 2.0, each cycle gap edge is broken up into separate FASTA records, each representing a scaffold, and the gap edge itself is replaced by 10 Ns.