Long Ranger2.1, printed on 11/22/2024
Long Ranger detects large-scale Structural Variants based on barcode co-occurrences. In WGS samples, Long Ranger attempts to call deletions, inversions, and tandem duplications greater than 30Kbp, as well as large inter-chromosomal translocations (typically involving 10s of Kbp of moved sequence). In targeted samples, the same types of events are called, however, the minimum size of the events Long Ranger attempts to call is sample-specific and is defined as the 97.5th percentile of the molecule size distribution.
Versions of Long Ranger prior to 2.1 output large-scale SV calls in the BEDPE format. Starting with version 2.1 of Long Ranger, large-scale SV calls are provided in both BEDPE and VCF format. However, all BEDPE outputs might become deprecated in future releases. Also starting with version 2.1, Long Ranger outputs mid-scale deletions (50bp-30Kbp) in addition to large-scale SVs. These are only provided in the VCF format. |
The BEDPE format is similar to the BED format and can be used to describe pairs of genomic regions.
The BEDPE contains one SV per line with the following tab-delimited columns:
chrom1 | chromosome of the first breakpoint. |
start1 | start position of the first breakpoint. |
stop1 | end position of the first breakpoint. |
chrom2 | chromosome of the second breakpoint. |
start2 | start position of the second breakpoint. |
stop2 | end position of the second breakpoint. |
name | a unique string identifying the SV. |
qual | score (see below for details). |
strand1 | strand of the first breakpoint (not currently used; always '+'). |
strand2 | strand of the second breakpoint (not currently used; always '+'). |
filter | a semicolon-delimited list of filters that were applied to the SV, or single period (.) if the SV was not filtered out. |
info | extra information about the SV or a single period (.). |
In the 2.1 version of Long Ranger, the quality score of a structural variant is an estimate of the barcode support for the event. This is true for all BEDPE outputs. However, different pipelines (e.g. large-scale SV-caller vs CNV-caller) compute this in different ways. In earlier versions of Long Ranger the quality score was a log-likelihood score. |
Long Ranger defines each breakpoint as a region rather than a single position because sequencing parameters such as depth and target pull-down limit the resolution of breakpoint detection. |
The filter field (column 11) is a semicolon-delimited string of filters that the SV failed to pass. Below is a list of possible filters.
Filter | Description |
---|---|
BLACK_DIST | At least one breakpoint is within 10Kb of the blacklist (see also the BLACK_DIST1 and BLACK_DIST2 info fields below). |
BLACK_FRAC | The SV has >50% of base pairs overlapping the blacklist (see also the BLACK_FRAC info field below). |
SEG_DUP | The SV breakpoints are within 20Kb from copies of the same segmental duplication. |
LOWQ | Not confident/low quality candidate. |
The SV blacklist and segmental duplication list are included in the refdata-hg19, refdata-b37 and refdata-GRCh38 supplied with by Long Ranger. These lists define gaps and other ambiguous regions of the reference genome that have been found to raise spurious large-scale SV calls.
The info field (column 12) is a semicolon-delimited string of
key=value pairs. A single period (.
) in the value suggests
that the value is missing (eg. because the corresponding info key does not apply to this
entry of the BEDPE file).
Key | Description |
---|---|
BLACK1 | If the first breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK2 | If the second breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK_DIST1 | Distance between the first breakpoint and the blacklist. |
BLACK_DIST2 | Distance between the second breakpoint and the blacklist. |
BLACK_FRAC | Fraction of the SV length that overlaps the blacklist. |
NPAIRS | Number of read-pairs supporting the SV. |
NSPLIT | Number of split reads supporting the SV. |
SEG_DUP | Comma-separated list of segmental duplications that overlap the breakpoints of the SV. |
ALLELIC_FRAC | Fraction of barcodes at the SV locus that support the SV. |
FRAC_HAP_SUPPORT | Fraction of support coming from the assigned haplotype (for HET events). |
HAP_ALLELIC_FRAC | Fraction of barcodes on the assigned haplotype that support the SV. |
MATCHES | Comma-separated list of ground-truth SVs that match the BEDPE entry. Always missing (. ), unless a ground-truth list of SV calls is provided to the longranger pipeline. |
TYPE | Type of SV. If the breakpoints are <500Kb apart, this will be one of DEL (deletion), INV (inversion),
DUP (tandem duplication), or UNK (unknown type). All events with breakpoints >500Kb apart are marked as DISTAL . |
ORIENT | For the events of the DISTAL type, ORIENT shows how the breakpoints were rearranged.
ORIENT=++ means that the region downstream of the first breakpoint is joined to the region downstream of the second breakpoint,
ORIENT=+- means that the region downstream of the first breakpoint is joined to the region upstream of the second breakpoint, and so on.
ORIENT is set to '..' if the orientation is unknown or the type is not DISTAL . |
RP_LR | Log-likelihood ratio score of read-pair support. Higher values correspond to stronger read-pair evidence. |
RP_TYPE | SV type suggested by the orientation of read-pairs around the breakpoints. This is similar to the TYPE/ORIENT fields
but uses read-pair information instead of molecule/barcode-level information. Distal events (breakpoints >500Kb apart) are marked as
TRANS_RR , TRANS_RF , TRANS_FF , and TRANS_FR , where the last two characters show the
orientation of reads around each of the two breakpoints (F:forward, R: reverse). So RP_TYPE=TRANS_FR is equivalent to
ORIENT=-+ . However, the RP_TYPE does not have to be compatible with TYPE/ORIENT if the
signal at the molecule level and read-level are not in agreement. |
PS1/PS2 | Phase sets to which each of the breakpoints was assigned. |
HAPS | A comma separated list of two values (0, 1, or . ) showing the haplotype to which each of the two breakpoints was assigned. A period (. ) means that the corresponding breakpoint was unphased. For homozygous events (ZS=HOM), this field is always '.,. '. |
ZS | Inferred zygosity. This can be HOM , HET , or . for homozygous, heterozygous, or unknown zygosity respectively.
Note that for a somatic sample, this refers to the haplotype of origin, so a HET event is one that originated from one haplotype only. |