Cell Ranger2.1, printed on 12/22/2024
The assembly process operates independently on each cell barcode. The output for each cell barcode is a set of assembled contigs that represent the best estimate of transcript sequences present, along with per-base quality value estimates, and the number of UMIs and reads supporting each contig. The assembly algorithm proceeds through the following steps:
Trim known adapter and primer sequences from the 5′ and 3′ ends of reads using the cutadapt
tool.
This tool uses Smith-Waterman alignment and allows for a small number of differences from the expected primer sequences.
The FILTER_VDJ_READS stage approximately aligns reads to all the V(D)J gene segments included in the reference. Read-pairs that exceed a specified alignment score and include at least one 15bp exact match against at least one of the reference segments are included in the set of reads to be assembled. These mappings are not full alignments and are only used for filtering reads before assembly.
The ASSEMBLE_VDJ stage performs de novo assembly of reads from each cell barcode independently. The assembler will use at most 100k reads per cell barcode to avoid artifacts caused by extremely high coverage. The assembler only uses reads from UMIs that have at least 10 reads and are detected as V(D)J reads by the FILTER_VDJ_READS stage. NOTE: Cell Ranger 2.1 includes a --denovo
option that instructs the assembler to ignore the results of the filtering step and attempt to assemble all reads. This option can be useful when working with poorly annotated species.
The assembly algorithm is outlined here:
The assembler outputs the contig sequences associated with paths which are assigned at least one UMI with the mapping of the input read pairs that contributed to each contig in the all_contig.bam output file. For more details, please refer to the source code (https://github.com/10XGenomics/cellranger/blob/master/lib/rust/vdj_asm/src/asm.rs).
Each base in the assembled contigs is assigned a Phred-scaled Quality Value (QV), representing an estimate of the probability of an error at that base. The QV is computed with a hierarchical model that accounts for the errors in reverse transcription (RT), that will affect all reads with the same UMI, and sequencing errors which affect individual reads. The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed, so sequencing errors in individual reads are corrected rapidly. We estimate that the V(D)J RT reaction has an error rate of 1e-4 per base, so assembled bases that are covered by a single UMI will be assigned Q40, and bases covered by at least two UMIs will be assigned Q60.