Software  ›   pipelines

# Phased VCF

Long Ranger reports small variant calls in a VCF file, a standard format compatible with other tools. When appropriate, additional data produced by the Chromium platform are included in standard fields. However, in some cases we add fields to report data that are not yet accounted for by the spec.

## Phasing Results

Phasing results are encoded as per section 1.4.2 of the VCF standard, using genotype fields GT (genotype), PS (phase set), and PQ (phasing quality). Long Ranger also emits two non-standard tags containing additional data. The BX tag contains per-allele barcode information, and the JQ tag contains a phasing 'junction' quality value.

### GT field

The GT (genotype) field encodes allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1|0, or 1/2, etc. / indicates an unphased genotype, and | indicates a phased genotype. For phased genotypes, the allele to the left of the bar is haplotype 1, and the allele to the right of the bar is haplotype 2.

### PS field

PS (phase set) marks the set of variants that have been phased into a block. Variants with the same PS value are in the same phase block. Variants with different PS values are not phased with respect to one another, typically due to a lack of heterozygous SNPs or long-molecule coverage needed to extend a phase block. When evaluating phasing it is important to only consider phasing assertions within a single phase set. We use the recommended convention that the PS value is the position of the first variant in the phase set.

### Phasing Example

CHROMPosREFALTGTPS
chr11000AC0|11000
chr11010TG1|01000
chr12000CT0|12000
chr12005TG0/12000
chr12008GC0|12000

In this example we have two phase blocks, denoted by PS=1000 and PS=2000. PS=1000 spans position 1000-1010, and PS=2000 spans position 2000-2008. In PS=1000, haplotype 1 contains the REF A allele at position 1000, and the ALT G allele as position 1010, while haplotype 2 contains the ALT C allele at position 1000 and the REF T allele at position 1010.

In PS=2000, haplotype 1 contains REF alleles at position 2000 and 2008, while haplotype 2 contains ALT alleles. At position 2005, we have detected a variant but have not phased it, so we don't know which allele is on which haplotype.

PS=1000 and PS=2000 are different phase blocks, so we don't know if haplotype 1 in PS=1000 corresponds to haplotype 1 or haplotype 2 in PS=2000.

### PQ field

The PQ (phasing quality) tag is a phred-scaled probability that alleles are phased incorrectly in a heterozygous call. PQ is derived from the likelihood ratio of the maximum-likelihood phasing solution and an alternate solution where the phasing of this variant is flipped.

### JQ field

The JQ (junction quality) tag is a 10x-specific addition. It contains the phred-scaled probability that there is a large-scale phasing switch error occuring between this variant and the following variant. JQ is derived from the likelihood ratio of the best phasing solution and an alternate solution where every downstream variants is flipped. If flipping downstream variants doesn't decrease the likelihood much, the JQ will be low. Phase blocks are broken at variants with JQ < 25.

### BX field

BX stores the 10x barcodes supporting each allele of the variant. It is encoded as a comma-delimited string of the form:

BC_STRINGref,BC_STRINGalt1,BC_STRINGalt2,...

Each BC_STRING entry stores the barcode and base QV of each read that supported the corresponding allele. BC_STRING is semicolon-delimited strings consisting of underscore-delimited strings:

BC1_QUAL1-1_QUAL1-2_...;BC2_QUAL2-1_QUAL2-2...

Where BC1 is the first barcode, and QUAL1-1 and QUAL1-2 are the observed Phred qualities of the bases that aligned to the variant position

For example, a BX field that contains:

AAAA_40_38;CCCC_40,GGGG_39

encodes two BC_STRINGs--one for the reference allele (AAAA_40_38;CCCC_40) and one for the alternate allele (GGGG_39):

• two with the barcode AAAA (one with Phred score of 40 and one with Phred score of 38)
• one with the barcode CCCC and a Phred score of 40
• the alternate allele has one read supporting it, barcode GGGG with Phred score of 39

## VCF Filtering

The Long Ranger pipeline uses a number of custom filters, which will show up in the FILTER field of the VCF. These use barcode and phasing information to improve the quality of variant calls. Here we give an overview of how those filters are implemented.

FILTERDescription
10X_QUAL_FILTERA basic variant quality filter, tuned for 10x data. Heterozygous variants with QUAL < 15 and homozygous variants with QUAL < 50 will fail this filter.
10X_ALLELE_FRACTION_FILTERFilters heterozygous variants with allele fraction < 15%.
10X_PHASING_INCONSISTENT Flags heterozygous variants where the reads supporting each allele do not segregate cleanly onto the local haplotypes. The phasing algorithm compares the likelihoods of a background false-positive model and a sequencing error model to classify likely false positives. This is a powerful filter for reducing false-positive variant calls. Be aware that somatic or mosaic variants, where only a subset of the sample carries the variant, will be preferentially tagged with this filter. If you are interested in these variants, you may want to include these variants in your analysis.
10X_HOMOPOLYMER_UNPHASED_INSERTION A 10x-specific filter for insertions in homopolymers with length >= 4, that are unphased. This class of variant calls is observed to be mostly false positives.
10X_RESCUED_MOLECULE_HIGH_DIVERSITY Filter variants that are supported primarily by reads that have been 'rescued' with barcode-aware alignment, where the mapped molecule has a high degree of divergence from the reference. This filter reduces false-positive variant calls in complex duplicated loci that tend to have missing copies in the reference genome.