Cell Ranger4.0, printed on 05/16/2022
The three complementarity-determining regions are the portions of the amino acid sequence of a T or B cell receptor which are predicted to bind to an antigen. The nucleotide region encoding CDR3 spans the V(D)J junction, making it more diverse than that of the other CDRs. This serves as a useful way to identify unique chains. See here.
This is a known nucleotide sequence which serves as a unique identifier for a single GEM droplet. Each barcode usually contains reads from a single cell.
Collection of cells that share a set of productive CDR3 sequences by exact nucleotide match.
For a single clonotype and chain, a consensus built among all of the cells with that clonotype for that chain. This consensus is built by reassembling the corresponding contigs from all cells with the clonotype.
Contiguous sequence of bases produced by assembly.
A contig is full-length if it matches the initial part of a V gene, continues on, and ultimately matches the terminal part of a J gene.
When combining libraries made from different groups of GEMs into one analysis, we append in silico a small integer to the barcode of each read that identifies which library that read came from. This prevents barcode collisions, which otherwise create confusion in the form of virtual doublets.
The N50 of a sorted list of numbers is the midway point by weight. Example:
There are implementation differences for exactly how this is computed but they matter little when the list is long. Unlike the mean and median, the N50 discounts the contribution of many small numbers. That is why people use it!
The N-statistics, such as N50 or N99, are measures of centrality often used in genomics because they are somewhat robust to contamination by large numbers of low-value elements. In particular, the NXX is the value of the smallest element in the subset comprising the fewest, largest members such that the sum of the values of the subset is at least XX% of the total sum of the values of the data set. A larger value of an N-statistic indicates that a larger proportion of the total can be accounted for by large individual values, and for a given data set and YY greater than XX, the value for NYY will be less than or equal to the value for NXX. Thus, the N50 is essentially a weighted median.
Each first-strand cDNA synthesis from a transcript molecule incorporates a random 10 bp nucleotide sequence next to the cell barcode called the UMI. The UMI sequence in each read allows the pipeline to determine which reads came from the same transcript molecule. In other words, the cell barcode distinguishes between cells, and the UMI distinguishes between molecules (for example, RNA fragments) within a cell.