Cell Ranger2.0, printed on 11/23/2024
The assembly process operates independently on each cell barcode. The output for each cell barcode is a set of assembled contigs that represent the best estimate of transcript sequences present, along with per-base quality value estimates, and the number of UMIs and reads supporting each contig. The assembly algorithm proceeds through the following steps:
Trim known adapter and primer sequences from the 5’ and 3’ ends of reads using the cutadapt
tool.
This tool uses Smith-Waterman alignment and allows for a small number of differences from the expected primer sequences.
The FILTER_VDJ_READS stage aligns reads to all the V(D)J gene segments included in the reference. Read-pairs that contain one or more exact 20bp match to a reference segment are included in the set of reads to be assembled. These mappings are not full alignments and are only used for filtering reads prior to assembly.
The FILTER_UMIS stage analyzes the distribution of reads per UMI, to determine a threshold for 'real' UMIs that likely result from correct cDNA production and enrichment, vs 'bad' UMIs that may result from error processes in the assay.
The ASSEMBLE_VDJ stage performs de novo assembly of reads from each Cell Barcode independently. The assembler will downsample input reads to an average of 400 reads/UMI to avoid artifacts caused by extremely high coverage. The assembler only uses reads from UMIs that pass the FILTER_UMIS reads per UMI threshold and are detected as V(D)J reads by the FILTER_VDJ_READS stage.
The assembly algorithm is outlined here:
Each of the supported paths through the De Bruijn graph in Step 8 are emitted as contig sequences from the assembler. Input read pairs that contributed to each contig are mapped to the final contig sequence, and emitted in the all_contig.bam output file.
Each base in the assembled contigs is assigned a Phred-scaled Quality Value (QV), representing an estimate of the probability of an error at that base. The QV is assigned with a hierarchical model that accounts for the errors in reverse transcription (RT), that will affect all reads with the same UMI, and sequencing errors which affect individual reads. The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed, so sequencing errors in individual reads a corrected rapidly. We estimate that the V(D)J RT reaction has error rate of 1e-4 per base, so assembled bases that are covered by a single UMI will be assigned Q40, and bases covered by >= 2 UMIs will be assigned Q60.