Long Ranger2.2, printed on 11/21/2024
Analysis software for 10x Genomics linked read products is no longer supported. Raw data processing pipelines and visualization tools are available for download and can be used for analyzing legacy data from 10x Genomics kits in accordance with our end user licensing agreement without support. |
Long Ranger uses a new aligner called 'Lariat'. Lariat aligns all the linked reads for a single barcode simultaneously, with the prior knowledge that the reads arise from a small number of long (10kb - 200kb) molecules. This approach allows reads to be mapped to repetitive regions with modest copy number such as segmental duplications. Lariat is based on the original RFA method developed by Alex Bishara, Yuling Liu et al in Serafim Batzoglou’s lab at Stanford. (Genome Research, 2015). Lariat generates candidate alignments by calling the BWA C API, then performs the RFA inference to select the final mapping position and MAPQ.
Long Ranger wraps standard short-read variant callers to generate SNP and small indel calls.
We recommend using Long Ranger with GATK.
A bundled version of FreeBayes can also be used.
The variant callers are invoked by Long Ranger with best-practices parameters,
along with some parameter changes to optimize results for 10x libraries.
After phasing, the variant caller is invoked separately in 'haploid mode'
on reads from each phased haplotype. This step boosts the sensitivty of the variant calls by calling
low-allele fraction variants that are the dominant allele on one haplotype.
Variants called in this phase will be tagged with HAPLOCALLED=1
in the INFO field.
The POPULATE_INFO_FIELDS
stage determines which barcodes are associated with each observed allele of each heterzygous SNP.
Long Ranger aligns the raw read sequence to the sequence of both alleles to determine which allele the read supports. The phasing algorithm finds a phasing configuration that optimizes probabilistic of the barcoding and read-generation process. The basic model is similar to the model in HASH (Bansal and Halpern, Genome Research, 2008), with improvements that account for false-positive variant calls, incorrect assignment of alleles to barcodes, and the possibility that a barcode carries two molecules on opposite haplotypes of the same locus.
While phasing the alleles of each SNP, we also determine the haplotype of each input molecule, and tag each read in the input BAM with an 'HP' and 'PS' tags indicating the haplotype and phase set that each read came from. See our BAM documentation for details. Phased reads can be very valuable when analyzing more complex variation such as SVs, CNVs and somatic variation.
The large-scale SV caller looks for distant pairs of loci in the genome that share many more barcode than would be expected by chance. This overlap indicates that the two loci that are distant in the reference sequence are nearby in the sample and generates a candidate SV. Candidate SVs are refined by comparing the layout of reads and barcodes around the event the patterns expected in deletions, inversions, duplications, and translocations to identify the SV type and find the maximum-likelihood breakpoints.
Long Ranger also implements a hidden Markov model algorithm to find large-scale
CNVs using barcode coverage data. Barcode coverage is the count of the number
of long molecules that span a given position in the genome. Because the barcode
information smooths over local read coverage fluctuations, it is a much more
stable signal for large CNVs, especially at lower read coverage. Large
deletions that extend to the end of a chromosome do not generate barcode
overlap and will only be detected by this method. The results from this method
and the barcode overlap method are combined into the large_svs.vcf.gz
output
file produced by Long Ranger.
In whole-genome mode, Long Ranger calls deletion SVs in the 50bp-30kbp size range. Long Ranger uses haplotype-specific coverage drops and discordant read pairs to identify potential deletions. A local assembly of phased reads, or a probabilistic model of phased coverage and discordant reads is used to confirm the event and determine the breakpoints.
In targeted mode, Long Ranger calls heterozygous and homozygous deletion SVs ranging in size from 1 exon, up to 50kb. By looking for haplotype-specific drops in coverage Long Ranger can detect deletions without seeing any discordant read pairs. Sufficient coverage of phased reads is required to detect heterozygous deletions, which requires covered heterozygous variants in the vicinity of the event.
SV calls whose breakpoints overlap different copies of the same segmental duplication are filtered. Structural variation is enriched in such regions, so some of these calls may represent true events. However, a large fraction of calls in regions of structural variation are due to the aligners being unable to properly resolve repetitive regions, because small variations are often sufficient for unique and high-quality mapping to one or the other copy of a the segmental duplication.
The segmental duplications filter included with Long Ranger uses data derived from the Segmental Duplication track in the UCSC browser. To create a filter for a custom reference, consult Creating Custom SV Blacklist Files.
SV calls that are within 10 Kb of gaps or new sequences introduced in GRCh38 are also filtered because such calls likely represent misassemblies in hg19.
The gaps portion of the SV blacklist included with Long Ranger is derived from the gaps track in the UCSC browser, and GRCh38-based filtering is derived from the hg19 diff track in the UCSC browser. To create a blacklist for a custom reference, consult Creating Custom SV Blacklist Files.
Long Ranger comes with a pre-built SV blacklist and high-identity segmental duplication tracks for hg19 and GRCh38. You can find these files inside the Long Ranger tarball at:
longranger-cs/<version>/tenkit/lib/python/tenkit/sv_data/<genome build>/default_sv_blacklist.bed
longranger-cs/<version>/tenkit/lib/python/tenkit/sv_data/<genome build>/default_segdups.bedpe