Long Ranger2.0, printed on 11/22/2024
The principal output of the longranger run pipeline includes aligned reads with barcode and phasing information in BAM format, phased SNPs and indels in VCF format, and SV calls and candidates in BEDPE format. These are all standard file formats designed to interoperate with existing tools, and the additional information produced by the Chromium Platform are included as standards-compliant fields when appropriate.
The output of the SV calling code is BEDPE, a format similar to BED that describes pairs of genomic regions. Long Ranger uses this format to describe pairs of breakpoints that define a structural variant.
The BEDPE contains one SV per line with the following tab-delimited columns:
chrom1 - chromosome of the first breakpoint
start1 - start position of the first breakpoint
stop1 - end position of the first breakpoint
chrom2 - chromosome of the second breakpoint
start2 - start position of the second breakpoint
stop2 - end position of the second breakpoint
name - a unique string identifying the SV
qual - Phred-like quality score
strand1 - strand of the first breakpoint (not currently used; always '+
')
strand2 - strand of the second breakpoint (not currently used; always '+
')
filter - a semicolon-delimited list of filters that were applied to the SV, or single period (.
) if the SV was not filtered out
info - extra information about the SV or a single period (.
)
Long Ranger defines each breakpoint as a region rather than a single position because sequencing parameters such as depth and target pull-down limit the resolution of breakpoint detection. |
The filter field (column 11) is a semicolon-delimited string of filters that the SV failed to pass. The following filters may have been applied:
Filter | Description |
---|---|
BLACK_DIST | At least one breakpoint is within 10Kb of the blacklist (see also the BLACK_DIST1 and BLACK_DIST2 info fields below). |
BLACK_FRAC | The SV has >10% of base pairs overlapping the blacklist (see also the BLACK_FRAC info field below). |
SEG_DUP | The SV breakpoints are within 10Kb from copies of the same segmental duplication. |
NMATES | Both breakpoints of the SV participate in multiple (>5) SVs. This is an indication of low-complexity regions or barcode coalescence. |
LOW_MAPQ | Average MAPQ of reads in the call region < 40. Suggests potential alignment problems leading to a false positive call. |
DEPTH_DROP | Depth drop that is inconsistent with the presence of a deletion. Suggests alignment problems or coverage unevenness. |
HIGH_BC_COV | Barcode coverage on either breakpoint > 3 times the average barcode coverage genomewide. Suggests alignment problems leading to read pileups. |
TOO_MANY_FILTERED_BCS | More than 30% of the barcodes supporting the call have been associated with calls filtered by one or more of the other filters. |
The SV blacklist and segmental duplication list are included in the refdata-hg19 package required by Long Ranger. These lists define gaps and other ambiguous regions of the reference genome that have been found to raise spurious SV candidates and calls.
The info field (column 12) is a semicolon-delimited string of
key=value pairs. A single period (.
) in the value suggests
that the value is missing (eg. because the corresponding info key does not apply to this
entry of the BEDPE file). The following keys may be defined for a
given SV:
Key | Description |
---|---|
BCOV | Number of linked-read sets supporting the SV |
BLACK1 | If the first breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK2 | If the second breakpoint of the SV is too close to a blacklist element, this will be the type of the element (eg. centromere, gap). |
BLACK_DIST1 | Distance between the first breakpoint and the blacklist |
BLACK_DIST2 | Distance between the second breakpoint and the blacklist |
BLACK_FRAC | Fraction of the SV length that overlaps the blacklist |
FRAC_SUPPORT | Fraction of common barcodes between the two breakpoints that support the SV, weighted by their probability of belonging to the assigned haplotypes. |
HAPS | A comma separated list of two values (0, 1, or None) showing the haplotype to which each of the two breakpoints was assigned. HAPS=None,None means that the SV was called homozygous. |
HAP_PROBS | A comma separated list of 4 values, showing the confidence of the SV breakpoints being assigned to each of the following 4 sets of haplotypes: 00, 10, 01, 11.
In other words, this is the confidence on the value of HAPS. This is only meaningful for heterozygous events (HAPS is not None,None ). |
MATCHES | Comma-separated list of ground-truth SVs that match the BEDPE entry. Always missing (. ), unless a ground-truth list of SV calls is provided to the longranger run pipeline. |
NBCS1 | Number of linked-read sets overlapping the first breakpoint |
NBCS2 | Number of linked-read sets overlapping the second breakpoint |
NMATES1 | Number of SVs involving the first breakpoint. A large number usually suggests a false positive. |
NMATES2 | Number of SVs involving the second breakpoint. A large number usually suggests a false positive. |
NOOV | Rough estimate of the number of linked-read sets that oppose the presence of the SV (eg. linked-read sets from the haplotype that does not carry the SV). |
NPAIRS | Number of read-pairs supporting the SV |
NSPLIT | Number of split reads supporting the SV |
ORIENT | For the events of the DISTAL type, ORIENT shows how the breakpoints were rearranged.
ORIENT=++ means that the region downstream of the first breakpoint is joined to the region downstream of the second breakpoint,
ORIENT=+- means that the region downstream of the first breakpoint is joined to the region upstream of the second breakpoint, and so on.
ORIENT is set to '..' if the orientation is unknown or the type is not DISTAL . |
PS1/PS2 | Phase sets to which each of the breakpoints was assigned. |
RP_LR | Log-likelihood ratio score of read-pair support. Higher values correspond to stronger read-pair evidence. |
RP_TYPE | SV type suggested by the orientation of read-pairs around the breakpoints. This is similar to the TYPE/ORIENT fields
but uses read-pair information instead of molecule/barcode-level information. Distal events (breakpoints >500Kb apart) are marked as
TRANS_RR , TRANS_RF , TRANS_FF , and TRANS_FR , where the last two characters show the
orientation of reads around each of the two breakpoints (F:forward, R: reverse). So RP_TYPE=TRANS_FR is equivalent to
ORIENT=-+ . However, the RP_TYPE does not have to be compatible with TYPE/ORIENT if the
signal at the molecule level and read-level are not in agreement. |
SEG_DUP | Comma-separated list of segmental duplications that overlap the breakpoints of the SV |
SUPPORT | Number of barcodes that are more concordant with the presence than with the absence of an SV. This is usually less than BCOV. |
TYPE | Type of SV. If the breakpoints are <500Kb apart, this will be one of: DEL (deletion), INV (inversion),
DUP (tandem duplication), or UNK (unknown type). All events with breakpoints >500Kb apart are marked as DISTAL . |