Long Ranger2.1, printed on 10/13/2024
Versions of Long Ranger prior to 2.1 output large-scale SV calls in the BEDPE format. Starting with version 2.1 of Long Ranger, large-scale SV calls are provided in both BEDPE and VCF format. However, all BEDPE outputs might become deprecated in future releases. Also starting with version 2.1, Long Ranger outputs mid-scale deletions (50bp-30Kbp) in addition to large-scale SVs. These are only provided in the VCF format. |
The VCF outputs of the Structural Variant and Copy Number Variant pipelines largely follow the VCF standard. Below we describe a few additional conventions that we adopted in order to capture information provided by the 10x data and algorithms.
SVs with both breakpoints on the same phase set are described using a single VCF record.
In such cases, the type of the SV is given in the SVTYPE
info field and can
be one of DEL
, INV
, DUP
, or UNK
,
to mark respectively a deletion, inversion, tandem duplication, or event of unknown type.
The type of the SV is also encoded in the ALT
field (which is one of
<DEL>
, <INV>
, <DUP:TANDEM>
, or <UNK>
).
The second breakpoint (e.g. the end of a deletion or the second breakpoint of an inversion) is given by the
END
info field.
If the breakpoints are on different phase sets, each breakpoint is put in a separate
VCF record (otherwise we wouldn't know which phase set the genotype field refers to).
In this case, the ALT
field, describes the adjacency created by the breakpoint.
For information about describing adjacencies using breakends, please see
the VCF standard.
The VCF standard specifies that the SVTYPE
info field of breakends must be BND
.
Therefore, for SVs described as breakends,
we use a custom info field SVTYPE2
to specify the predicted type of
the SV.
All breakends referring to the same event have the same value of the
EVENT
info field. In addition, for each breakend of the event,
the MATEID
info field points to the other breakend of the same event.
Inversions with the two breakpoints on different phase sets are split into four separate VCF records. This is done because each inversion breakpoint implies two sets of adjacencies. For examples, see section on inversions in the VCF standard).
The possible filter fields in our SV VCF files are similar to the filters applied to
the entries of the SV BEDPE output.
A VCF entry that passes all filters has the value PASS
in the filter column.
All other SV entries either have low support or are in spurious regions of the genome.
Filter | Description |
---|---|
SVTYPE | Type of the SV (DEL , INV , DUP , or UNK ), or BND for SVs described as sets of breakends. |
SVTYPE2 | Type of the SV, for SVs described as breakends. |
IMPRECISE_DIR | Flag indicating that the orientation of the adjacency is unknown. This only applies to SVs described using breakends. |
SVLEN | Length of the variant. |
END | Second breakpoint of the SV, for SVs given in a single record. |
EVENT | Unique name of the SV which can be used to group together breakends referring to the same event. |
MATEID | Name of the other breakend of the same event. |
CIPOS/CIEND | Uncertainty around the predicted first and second breakpoints of the event. This is a tuple of two values specifying the region of Uncertainty around the POS (or END) value. |
The remaining info fields in our SV VCFs are similar to the fields in the BEDPE output.
The quality scores in our Structural Variant VCFs are not Phred-scaled probabilities.
Instead, the value provided in the QUAL
field is an estimate of the barcode
support for the corresponding event. Our SV-calling algorithms estimate the appropriate
cutoff for this value based on the properties of the sample, such as depth and loaded mass.
Note that this implies that the quality score values are not necessarily comparable across samples.
Our large-scale SV calling pipeline, which identifies SVs greater than 30Kb, outputs two BEDPE files, one with high-quality calls, and another one with low-quality candidates. This is done for compatibility reasons with earlier versions of the pipeline.
However, the VCF files output by our pipelines might contain both high-quality calls
and low-quality candidates. High-quality calls
passing all filters have the PASS
flag in the filters field.