Cell Ranger3.1, printed on 12/05/2024
The assembly process takes as input the reads for a single barcode, and 'glues' these reads together, yielding as output a set of assembled contigs that represent the best estimate of transcript sequences present. In addition each base in each contig is assigned a quality value. We also track the numbers of UMIs and reads supporting each contig.
The problem of making contigs is complicated because there are many forms of 'noise' in the data. These arise from background (extracellular) mRNA, cell doublets, errors in transcription in the cell, errors in reverse transcription to make cDNA, random errors during sequencing, index hopping in the sequencing process, and in other ways.
The assembly process uses the reference sequence in places, as noted below, unless the pipeline is run in denovo mode. See also Annotation Algorithm, parts of which are used in the assembly algorithm.
The assembly algorithm proceeds through the following steps:
Step | Operation |
---|---|
Read subsampling | Reduce the reads for a given barcode to at most 80,000, because more reads don't help. |
Read trimming | Trim off read bases after enrichment primers. |
Graph formation | Build a De Bruijn graph using k = 20 |
Reference-free graph simplification | Simplify the graph by removing 'noise' edges. |
Reference-assisted graph simplification | Same, but this time use the reference. |
UMI filtering | Filter out UMIs that are likely to be artifacts. |
Contig construction | Make contigs by looking for the best path through the graph for each UMI. |
Competitive deletion of contigs | Remove contigs that are much weaker than other contigs and which are likely to be artifacts. |
Contig confidence | Define the high confidence contigs, which are likely to represent bona fide transcripts from a single cell associated to a barcode. |
Contig quality scores | Assign a quality score to each base on each contig. |
And in more detail:
Very high coverage of a given cell can occur because of over-all high sequencing coverage, or more commonly, for BCR, because of high expression in plasma cells.
Very high coverage of a given cell adds little information but degrades computational performance. We therefore cap the number of reads per barcode. At most 80,000 reads per barcode are used. If there are more than this number, we subsample.
The inner enrichment primers hybridize to constant regions of V(D)J genes. Any bases to the right of those positions should not be present in the data, so we remove them from the reads.
We create a De Bruijn graph using k = 20 and transform it into a directed graph whose edges are DNA sequences, corresponding to unbranched paths in the De Bruijn graph.
A collection of heuristic steps are applied to simplify the graph. During this process we track and edit the read support on each edge. We describe here several examples of simplification steps.
Branch cleaning.
(I) For each branch in the graph, and for each UMI, if one
branch has ten times more reads from the UMI supporting it than the other branch
has from the UMI, we remove read support for the UMI from the other branch.
(II) Given two branches emanating from a vertex, this deletes the
weak branch if all of the following hold:
(a) there are at least twice as many reads on the strong branch;
(b) the weak branch does not have more than 8 reads for any UMI;
(c) for every UMI, the strong branch has at least as many reads,
with at most one possible exception, where it can have just one read.
Path cleaning. For each UMI, define a single strongest path for the UMI. Then delete any graph edges that are not on such a strong path.
Component cleaning. For each UMI, if one graph component has ten times more reads supporting it from that UMI than another component, delete the read support for that UMI from the other components.
If the pipeline is not run in denovo mode, we pop bubbles in the graph with the aid of the reference sequence. There are several heuristic tests, all of which require that both bubble branches have the same length. For example, if in addition, one branch is supported by at least three UMIs, and the other branch is supported by only one UMI, and the first branch has a kmer matching the reference, we delete the weak branch.
Here we define a set of filtered UMIs.
As above, we define a single strongest path for each UMI. We consider only strong paths that either contain a reference kmer, or in the denovo case, match a primer.
Find good graph edges: edges that appear on one or more strong paths.
We find all the reads assigned to one or more good edge.
Find the UMIs for these reads.
Remove any UMI for which less than 50% of the kmers in the reads for the UMI are contained in good edges.
In the non-denovo case, if no strong path had a V segment annotation, remove all the UMIs.
The UMIs that survive this process are the filtered UMIs.
Initially, every strong path that (a) contains an enrichment primer or (b) is annotated by a CDR3 (in the non-denovo case) is called a contig.
Some things we do (but not in the denovo case):
Contigs are trimmed to remove sequences occurring before a 5' UTR for a V segment. Contigs are also trimmed to remove sequences occurring after enrichment primers.
Delete contigs that have only a C annotation. These deleted contigs are enriched for artifacts.
If a contig has a single-base indel relative to the reference, that is supported by only UMI, or only one UMI plus one additional read, we fix it.
We remove contigs that are less than 300 bases long.
At this stage, there can be some redundancy amongst contigs, arising from actual differences in transcripts or laboratory technical artifacts or artifacts in contig construction. To eliminate this, we first compute the number of UMIs assigned to each contig. In the non-denovo case, if two productive contigs share the same junction sequence (defined as 100 bases ending at the end of a J segment), we pick one that has the most UMIs (arbitrarily in the case of a tie). In the denovo case, if two contigs are annotated with the same CDR3 sequence, we again pick the one having the most UMIs.
Nonproductive contigs also also deduplicated. Any such contig for which at least 75% of its kmers is contained in a productive contig is deleted. If 75% of the kmers in a nonproductive contig are contained in a longer nonproductive contig, we delete the shorter one. In the denovo case, the same criteria apply, with productive replaced by "has a CDR3".
Contigs are competitively deleted with the goal of deleting contigs that arise from extracellular mRNA in the sample or other 'background' processes.
First consider the non-denovo case. For each productive contig, we define its junction sequence to be the 100 bases on the contig that ends at the end of the annotated J segment. Then the junction UMI support for the contig is defined to be the number of UMIs that cover the junction sequence. We also consider the reads that support the junction sequence, the junction read support.
Suppose we have two contigs, with respective (junction UMI support, junction read support) = (u1,n1) and (u2,n2). Suppose that (u1,n1) is sufficiently larger than (u2,n2). For example, u1 ≥ 2, u2 = 1, n1 ≥ 2 * n2 would qualify. (And there are some similar criteria, not listed here.) Then if the contigs have the same chain type, we delete the second contig.
In denovo mode, a similar criterion is applied to contigs containing a CDR3, but as junction segment we instead use the 100 bases starting from the end of the CDR3, we do not consider chain type when deleting a contig, and the two strongest contigs are protected from deletion.
Incorrect clonotypes can arise from sources such as extracellular mRNA or doublets. To protect against this, we declare that some contigs have low confidence (and subsequently, these contigs are excluded from clonotypes). We first describe how this is done in the non-denovo case. Nonproductive contigs are declared to have low confidence. Productive contigs are initially assigned high confidence, then all are downgraded to low confidence if any of the following apply:
There are more than four productive contigs
In the non-denovo case, there are more than two contigs for a chain type (e.g. three TRA contigs)
There are less than three filtered UMIs each having at least three read pairs
All productive contigs have junction UMI support at most one, and either there are less than five filtered UMIs each having at least three read pairs, or there are more than two productive contigs
Some productive contig has junction UMI support at most one, and three are less than three filtered UMIs each having at least n/20 read pairs, where n is the N50, taken over all UMIs in the entire dataset that have at least two read pairs, of the number of read pairs in such UMIs
Individual productive contigs are downgraded if their junction UMI support is at most one and the number of productive contigs exceeds two.
In the denovo case, the same criteria are applied, with 'productive contig' replaced by 'contig having a CDR3 sequence', except that the chain type test is not applied.
Each base in the assembled contigs is assigned a Phred-scaled Quality Value (QV), representing an estimate of the probability of an error at that base. The QV is computed with a hierarchical model that accounts for the errors in reverse transcription (RT), that will affect all reads with the same UMI, and sequencing errors which affect individual reads. The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed, so sequencing errors in individual reads are corrected rapidly. We estimate that the V(D)J RT reaction has an error rate of 1e-4 per base, so assembled bases that are covered by a single UMI will be assigned Q40, and bases covered by at least two UMIs will be assigned Q60.