HOME  ›   pipelines

# Assembly Algorithm

The assembly process operates independently on each cell barcode. The output for each cell barcode is a set of assembled contigs that represent the best estimate of transcript sequences present, along with per-base quality value estimates, and the number of UMIs and reads supporting each contig. The assembly algorithm proceeds through the following steps:

Trim known adapter and primer sequences from the 5′ and 3′ ends of reads using the cutadapt tool. This tool uses Smith-Waterman alignment and allows for a small number of differences from the expected primer sequences.

The FILTER_VDJ_READS stage approximately aligns reads to all the V(D)J gene segments included in the reference. Read-pairs that exceed a specified alignment score and include at least one 15bp exact match against at least one of the reference segments are included in the set of reads to be assembled. These mappings are not full alignments and are only used for filtering reads before assembly.

## Assembly

The ASSEMBLE_VDJ stage performs de novo assembly of reads from each cell barcode independently. The assembler will use at most 100k read-pairs per cell barcode to avoid artifacts caused by extremely high coverage. The assembler uses only those reads that are detected as V(D)J reads by the FILTER_VDJ_READS stage and whose UMIs have a minimum number of reads. This threshold is computed for each run based on the distribution of read-pairs per UMI (see UMI Filtering section). NOTE: Cell Ranger 2.1 and later includes a --denovo option that instructs the assembler to ignore the results of the filtering step and attempt to assemble all reads. This option can be useful when working with poorly annotated species.

The assembly algorithm is outlined here:

1. Build a De Bruijn graph (using a kmer length, K=32) of the sequences in the set of accepted read pairs, keeping track of the number of reads and UMIs associated with each kmer in the graph. A kmer must be included in two or more reads to be incorporated in the graph.
2. Sort the nodes in the graph in the decreasing order of "read support" (defined as the number of reads associated with the node).
3. Select the best-supported node as the root node.
4. Find "high quality" paths from the root node, extending in both directions. "High quality" means that the branched paths are followed only if the branching base has at least some minimum quality.
5. Mark all the nodes on the returned paths as visited.
6. If all the nodes have been visited, continue to step 7. Otherwise, choose the next best-supported node which is not visited, as the root node and return to step 4. (Note: Only the nodes which account for 99% of the reads are considered as root nodes to reduce the number of extraneous path explorations)
7. For each path, collect the set of UMIs linked to the nodes in the path. Compute a mapping score for all such UMIs based on how well the reads corresponding to the UMI maps to the contig implied by the path. Calculate the total score of the path as the sum of assigned scores across all UMIs. Additionally, keep track of the best mapping score for all UMIs.
8. Sort the paths in the decreasing order of total scores.
9. For each UMI, assign the first path (in the order described in Step 8) with a mapping score within a certain fraction (60%) of the best mapping score for that UMI.
10. Remove the paths that have no UMIs assigned in Step 9.

The assembler outputs the contig sequences associated with paths which are assigned at least one UMI with the mapping of the input read pairs that contributed to each contig in the all_contig.bam output file. For more details, please refer to the source code (https://github.com/10XGenomics/cellranger/blob/master/lib/rust/vdj_asm/src/asm.rs).

## Assembly Quality Values

Each base in the assembled contigs is assigned a Phred-scaled Quality Value (QV), representing an estimate of the probability of an error at that base. The QV is computed with a hierarchical model that accounts for the errors in reverse transcription (RT), that will affect all reads with the same UMI, and sequencing errors which affect individual reads. The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed, so sequencing errors in individual reads are corrected rapidly. We estimate that the V(D)J RT reaction has an error rate of 1e-4 per base, so assembled bases that are covered by a single UMI will be assigned Q40, and bases covered by at least two UMIs will be assigned Q60.