Cell Ranger2.1, printed on 11/24/2024
Each assembled contig in each cell is aligned against all of the germline segment reference sequences via Smith-Waterman.
First the contig is aligned to all V reference sequences. The best match is found and the matching bases are masked from the contig. Then the same procedure is followed one-by-one for D, J, C, and 5′ UTR reference sequences.
Next, the CDR3 region is sought in 2 different ways. If the sequence fully spans the L+V region, which contains the start codon, then search for a CDR3 motif (Cys-to-FGXG/WGXG) in that frame, starting from the C-terminal Cys residue in the aligned V region. Otherwise, search for a CDR3 sequence in all frames. CDR3 regions are restricted to be at least 26 and at most 80 nucleotides long.
Each contig's productivity is determined by these criteria:
It is expected that each cell barcode typically contains two matching productive contigs, comprising either a TRA and a TRB, or a heavy chain (IGH) and a light chain (IGK or IGL). Additional productive contigs produced by the assembler are less likely to be legitimate. For each chain, we define the contig with the most UMIs as the 'top' contig. Further productive contigs with distinct CDR3s must have at least 2 UMIs and greater than 0.2 x UMIs of the top contig to be considered confident. Otherwise, they are considered low-confidence. Additionally, extra productive contigs with the same CDR3 as an existing contig for that chain are considered low-confidence; these are likely induced by assembly artifacts. Productive contigs not labeled as low-confidence are labeled as high-confidence. Only high-confidence contigs will appear in the Loupe V(D)J browser.
Cell barcodes are grouped together into clonotypes if they share the same set of productive CDR3 nucleotide sequences by exact match. Note that for B cells, somatic mutations that fall within the CDR3 will break up clonotypes that are in fact clonally related. Cells with somatic mutations outside the CDR3 will be considered to share a clonotype.
For each clonotype and each CDR3, the contigs in all cells are assembled together to produce a clonotype consensus sequence.
Because this sequence is constructed using multiple cells, its accuracy is expected to be even higher than sequences constructed from a single cell.