Cell Ranger3.1, printed on 01/21/2022
The goals of V(D)J contig annotation are to define alignments of V, D and J segments to a contig, identify CDR3 sequences, and from these data determine if a contig is productive, meaning that it is likely to correspond to a functional T or B cell receptor.
For a given dataset, the pipeline first determines if the data are TCR or BCR, then accordingly aligns all contigs to the TCR or BCR reference sequences. In rare (mixed) cases contigs are aligned to both. Alignment is seeded on 12-mer perfect matches, followed by heuristic extension; we also search backward from C segment alignments for J segment alignments that do not have 12-mer perfect matches, as these will arise occasionally from somatic hypermutation.
It is important to understand that the choice of V(D)J reference sequences in an alignment can be arbitrary, depending on how similar the reference sequences are to each other. For D segments, which are both short and more mutated, it is often not possible to find a confident alignment, and an alignment may not be shown.
A contig is termed productive if the following conditions are met:
Full length requirement. The contig matches the initial part of a V gene. The contig continues on, ultimately matching the terminal part of a J gene.
Start requirement. The initial part of the V matches a start codon on the contig. Note that in the human and mouse reference sequences supplied by 10x, every V segment begins with a start codon.
Nonstop requirement. There is no stop codon between the V start and the J stop.
In-frame requirement. The J stop minus the V start equals one mod three. This just says that the codons on the V and J segments are in frame.
CDR3 requirement. There is an annotated CDR3 sequence (see below).
Structure requirement. Let VJ denote the sum of the lengths of the V and J segments. Let len denote the J stop minus the V start, measured on the contig. Then VJ - len lies between -25 and +25, except for IGH, which must be between -55 and +25. This condition is imposed to preclude anomalous structure changes that are unlikely to correspond to functional proteins.
For each contig we search for a CDR3 sequence, using the fact that the flanking sequences of CDR3s are conserved. We compare to motifs derived from V and J reference segments for human and mouse, exactly as shown below. Here a letter represents a specific amino acid and a dot represents any amino acid.
left flank CDR3 right flank LQPEDSAVYY C... LTFG.GTRVTV VEASQTGTYF LIWG.GSKLSI ATSGQASLYL
We require that a CDR3 sequence have length between 5 and 27 amino acids, start with a C, and not contain a stop codon. The flanking sequences for a candidate CDR3 are matched against the above motifs, and scored +1 for each position that matches one of the entries in a column.
scores 2 for the first three amino acids. (The L matches an entry in the first column, so contributes 1 to the score. The T matches an entry in the second column, so contributes 1 to the score. The Y does not match the third column, so does not contribute to the score.)
Next we find the implied stop position of the end of the V segment on the contig. That is the start position of the V segment on the contig, plus the length of the V segment. Then we require that the CDR3 sequence start at most 10 bases before the stop and start at most 20 bases after the stop of the V. (The condition of this paragraph is not applied in the denovo case.)
If there is more than one CDR3 sequence, we choose the one having the highest score. If there is a tie, we choose the one having the later start position on the contig. If there is still a tie, we choose the longer CDR3.
Cell barcodes are grouped together into clonotypes if they share the same set of productive CDR3 nucleotide sequences by exact match. Note that for B cells, somatic mutations that fall within the CDR3 will break up clonotypes that are in fact clonally related. Cells with somatic mutations outside the CDR3 will be considered to share a clonotype.
For each clonotype and each CDR3, the contigs in all cells are assembled together to produce a clonotype consensus sequence.
Because this sequence is constructed using multiple cells, its accuracy is expected to be even higher than sequences constructed from a single cell.