Software  ›   pipelines

# Phased Large-Scale Structural Variants in BEDPE Format

Long Ranger detects large-scale Structural Variants based on barcode co-occurrences. In WGS samples, Long Ranger attempts to call deletions, inversions, and tandem duplications greater than 30Kbp, as well as large inter-chromosomal translocations (typically involving 10s of Kbp of moved sequence). In targeted samples, the same types of events are called, however, the minimum size of the events Long Ranger attempts to call is sample-specific and is defined as the 97.5th percentile of the molecule size distribution.

The BEDPE format is similar to the BED format and can be used to describe pairs of genomic regions.

The BEDPE contains one SV per line with the following tab-delimited columns:

 chrom1 chromosome of the first breakpoint. start1 start position of the first breakpoint. stop1 end position of the first breakpoint. chrom2 chromosome of the second breakpoint. start2 start position of the second breakpoint. stop2 end position of the second breakpoint. name a unique string identifying the SV. qual score (see below for details). strand1 strand of the first breakpoint (not currently used; always '+'). strand2 strand of the second breakpoint (not currently used; always '+'). filter a semicolon-delimited list of filters that were applied to the SV, or single period (.) if the SV was not filtered out. info extra information about the SV or a single period (.).

## Filter Entries

The filter field (column 11) is a semicolon-delimited string of filters that the SV failed to pass. Below is a list of possible filters.

FilterDescription
BLACK_DISTAt least one breakpoint is within 10Kb of the blacklist (see also the BLACK_DIST1 and BLACK_DIST2 info fields below).
BLACK_FRACThe SV has >50% of base pairs overlapping the blacklist (see also the BLACK_FRAC info field below).
SEG_DUPThe SV breakpoints are within 20Kb from copies of the same segmental duplication.
LOWQNot confident/low quality candidate.

The SV blacklist and segmental duplication list are included in the refdata-hg19, refdata-b37 and refdata-GRCh38 supplied with by Long Ranger. These lists define gaps and other ambiguous regions of the reference genome that have been found to raise spurious large-scale SV calls.

## Info entries

The info field (column 12) is a semicolon-delimited string of key=value pairs. A single period (.) in the value suggests that the value is missing (eg. because the corresponding info key does not apply to this entry of the BEDPE file).

KeyDescription
BLACK1If the first breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap).
BLACK2If the second breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap).
BLACK_DIST1Distance between the first breakpoint and the blacklist.
BLACK_DIST2Distance between the second breakpoint and the blacklist.
BLACK_FRACFraction of the SV length that overlaps the blacklist.
NPAIRSNumber of read-pairs supporting the SV.
NSPLITNumber of split reads supporting the SV.
SEG_DUPComma-separated list of segmental duplications that overlap the breakpoints of the SV.
ALLELIC_FRAC Fraction of barcodes at the SV locus that support the SV.
FRAC_HAP_SUPPORT Fraction of support coming from the assigned haplotype (for HET events).
HAP_ALLELIC_FRAC Fraction of barcodes on the assigned haplotype that support the SV.
MATCHESComma-separated list of ground-truth SVs that match the BEDPE entry. Always missing (.), unless a ground-truth list of SV calls is provided to the longranger pipeline.
TYPE Type of SV. If the breakpoints are <500Kb apart, this will be one of DEL (deletion), INV (inversion), DUP (tandem duplication), or UNK (unknown type). All events with breakpoints >500Kb apart are marked as DISTAL.
ORIENT For the events of the DISTAL type, ORIENT shows how the breakpoints were rearranged. ORIENT=++ means that the region downstream of the first breakpoint is joined to the region downstream of the second breakpoint, ORIENT=+- means that the region downstream of the first breakpoint is joined to the region upstream of the second breakpoint, and so on. ORIENT is set to '..' if the orientation is unknown or the type is not DISTAL.
RP_LRLog-likelihood ratio score of read-pair support. Higher values correspond to stronger read-pair evidence.
RP_TYPE SV type suggested by the orientation of read-pairs around the breakpoints. This is similar to the TYPE/ORIENT fields but uses read-pair information instead of molecule/barcode-level information. Distal events (breakpoints >500Kb apart) are marked as TRANS_RR, TRANS_RF, TRANS_FF, and TRANS_FR, where the last two characters show the orientation of reads around each of the two breakpoints (F:forward, R: reverse). So RP_TYPE=TRANS_FR is equivalent to ORIENT=-+. However, the RP_TYPE does not have to be compatible with TYPE/ORIENT if the signal at the molecule level and read-level are not in agreement.
PS1/PS2Phase sets to which each of the breakpoints was assigned.
HAPSA comma separated list of two values (0, 1, or .) showing the haplotype to which each of the two breakpoints was assigned. A period (.) means that the corresponding breakpoint was unphased. For homozygous events (ZS=HOM), this field is always '.,.'.
ZS Inferred zygosity. This can be HOM, HET, or . for homozygous, heterozygous, or unknown zygosity respectively. Note that for a somatic sample, this refers to the haplotype of origin, so a HET event is one that originated from one haplotype only.