Software  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Immune Profiling

Annotation Algorithm

The goals of V(D)J contig annotation are to define alignments of V, D and J segments to a contig, identify CDR3 sequences, and from these data determine if a contig is productive, meaning that it is likely to correspond to a functional T or B cell receptor.

Alignment to the V(D)J Reference

For a given dataset, the pipeline first determines if the data are TCR or BCR, then accordingly aligns all contigs to the TCR or BCR reference sequences. In rare (mixed) cases contigs are aligned to both. Alignment is seeded on 12-mer perfect matches, followed by heuristic extension; we also search backward from C segment alignments for J segment alignments that do not have 12-mer perfect matches, as these will arise occasionally from somatic hypermutation.

It is important to understand that the choice of V(D)J reference sequences in an alignment can be arbitrary, depending on how similar the reference sequences are to each other. For D segments, which are both short and more mutated, it is often not possible to find a confident alignment, and an alignment may not be shown.

Productive Contigs

A contig is termed productive if the following conditions are met:

CDR3

For each contig we search for a CDR3 sequence, using the fact that the flanking sequences of CDR3s are conserved. We compare to motifs derived from V and J reference segments for human and mouse, exactly as shown below. Here a letter represents a specific amino acid and a dot represents any amino acid.

left flank   CDR3   right flank
LQPEDSAVYY   C...   LTFG.GTRVTV
VEASQTGTYF          LIWG.GSKLSI
ATSGQASLYL

We require that a CDR3 sequence have length between 5 and 27 amino acids, start with a C, and not contain a stop codon. The flanking sequences for a candidate CDR3 are matched against the above motifs, and scored +1 for each position that matches one of the entries in a column.

For example,

LTY.... 
scores 2 for the first three amino acids. (The L matches an entry in the first column, so contributes 1 to the score. The T matches an entry in the second column, so contributes 1 to the score. The Y does not match the third column, so does not contribute to the score.)


For a candidate CDR3 to be declared a CDR3 sequence, it must score at least 10. In addition the left flank must contribute at least 3 and the right flank must contribute at least 4.

Next we find the implied stop position of the end of the V segment on the contig. That is the start position of the V segment on the contig, plus the length of the V segment. Then we require that the CDR3 sequence start at most 10 bases before the stop and start at most 20 bases after the stop of the V. (The condition of this paragraph is not applied in the denovo case.)

If there is more than one CDR3 sequence, we choose the one having the highest score. If there is a tie, we choose the one having the later start position on the contig. If there is still a tie, we choose the longer CDR3.

Clonotype Grouping and Consensus Building

Cell barcodes are grouped together into clonotypes if they share the same set of productive CDR3 nucleotide sequences by exact match. Note that for B cells, somatic mutations that fall within the CDR3 will break up clonotypes that are in fact clonally related. Cells with somatic mutations outside the CDR3 will be considered to share a clonotype.

For each clonotype and each CDR3, the contigs in all cells are assembled together to produce a clonotype consensus sequence.

Because this sequence is constructed using multiple cells, its accuracy is expected to be even higher than sequences constructed from a single cell.