Supernova2.1, printed on 11/21/2024
Analysis software for 10x Genomics linked read products is no longer supported. Raw data processing pipelines and visualization tools are available for download and can be used for analyzing legacy data from 10x Genomics kits in accordance with our end user licensing agreement without support. |
Once your assembly has completed (yielding binary data structures), use the command supernova mkoutput
to generate a FASTA file representing your assembly. Read about the full details below the usage example.
supernova mkoutput \ --style=raw|megabubbles|pseudohap|pseudohap2 \ --asmdir=/path/to/outs/assembly \ --outprefix=output_filename_prefix \ [ --minsize=N ] \ [ --headers=short|full ]
There are four styles of FASTA output:
The raw style, identified in FASTA records as style=1
, represents every edge in the assembly as a FASTA record (seen as red segments in the above illustration). These include microbubble arms and gaps. Gaps are represented by an edge consisting of a sequence of Ns roughly approximating the size of the gap: Gaps captured by read pairs are represented by 100 Ns; other gaps are longer. In addition, where cycles are present in the graph, an arbitrary path is chosen through the cycle, and the sequence for that path is suffixed by 10 Ns. Bubbles and gaps generally appear once per 10-20 kb. Raw graph records are roughly two orders of magnitude shorter than megabubble arms. For each edge in the raw graph, there is also an edge written to the FASTA file representing the reverse complement sequence.
This is the most detailed of the output styles, in that it is the only one to include reverse complement edges and not flatten bubbles. It is also the only one in which gaps remain as their own distinct edges.
In this style, identified in FASTA records as style=2
, each megabubble arm corresponds to a FASTA record, as does each intervening sequence.
Bubbles are flattened by selecting the branch having highest coverage. Gaps are joined to adjacent sequences, resulting in longer edges that represent the gaps internally by sequences of Ns. Reverse complement edges are not represented.
The pseudohap style, identified in FASTA records as style=3
, generates a single record per scaffold. For example, the seven red edges on top of the illustration for --style=megabubbles
, corresponding to seven FASTA records, are combined into a single FASTA record. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles.
As in the megabubble graph, microbubbles are flattened, gaps are joined to adjacent sequences, and reverse complement edges are not represented.
This style, identified in FASTA records as style=4
, is like the pseudohap option, except that for each scaffold, two ‘parallel’ pseudohaplotypes are created and placed in separate FASTA files. Records in these files are parallel to each other.
This should be set to the path of the assembly output directory created by Supernova. This will be the directory outs/assembly
underneath where your pipeline is stored.
This is a filename prefix for assembly output. This can include a relative or absolute path. For instance, specifying --outprefix=/x/y/z
will create a FASTA file in the directory /x/y
called z.fasta
. Note that for the pseudohap2 option, which creates two files, these would be called z.1.fasta
and z.2.fasta
For output styles other than raw, you may choose to print only those FASTA records longer than a given size. In raw mode all FASTA records are printed.
--minsize=n [specify minimum FASTA record size in bases, default: 1000]
For the pseudohap2 output style only, this option causes an index file (suffixed .idx instead of .fasta) to be written out for each of the two pseudohap files. The index files give the coordinates in bases, for each record, of each transition between megabubble and non-megabubble (nominally homozygous) sequences. Consider the image below:
A record in the pseudohap index file corresponding to the top path in the graph might look like this:
359 0 1665 3604556 3622153 9159962 9813322 10398144 10429438
The first number on the line is the record number as seen in the corresponding pseudohap FASTA file. There are eight numbers remaining on the line corresponding to the eight transitions in the image (numbered 0 to 7).
The first coordinate is always zero and the last coordinate is always the length of the entire record in bases, with N's included.
So from the file we see that the sequence between 0 and 1 is 1665 bases long and that the top arm of the first megabubble, between 1 and 2, starts at base 1665 on the line and is 3604556-1665 = 3,602,891 bases long.
The second index file corresponding to the second pseudo-haplotype, and the bottom path through the graph, will have a record 359 with the same number of entries.
A scaffold can have any number of megabubbles, even zero. In that case the scaffold is entirely unphased and appears identically in both pseudohap files. The index file lines will each contain three numbers.
Note that the index file can be used to calculate the phase block N50 (see metrics).
By default, FASTA header lines show only the start and end edges of the path associated to the record. Optionally, the entire path associated with each edge may be displayed. This will yield some huge header lines, which may break other software.
Mode | Behavior |
---|---|
--headers= full | verbose output; all edge ids are written. |
--headers= short | only first and last edge ids shown; this is the default. |
A --headers=short
record might look like this:
>55 edges=4..6 left=15 right=88 ver=1.7 style=1 ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT
And the same example shown above with --headers=full
might have the following header line:
>55 edges=4,15,33,7,6 left=15 right=88 ver=1.7 style=1 ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCAACCATATATATCCCCAGAGGAGGGATTT TTAGGACATTAGCCCACCAAATTTACACACTTATATATATTTTATCGGAGCTCCAGTCCCGCCCAAAAACTTTACGTTTT
In the raw output format, cycles are shown directly in the graph, as single 'loop' edges or more complicated cyclic structures.
Internally, Supernova captures cyclic structures within longer scaffolds as 'cycle gap edges', which are single edges encoding a more complex graph structure. In the raw graph output format, these cycle gap edges are expanded out, and thus seen either as single 'loop' edges or more complicated cyclic structures consisting of multiple edges. In the non-raw output formats, starting with Supernova version 2.0, each cycle gap edge is broken up into separate FASTA records, each representing a scaffold, and the gap edge itself is replaced by 10 Ns.