HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Immune Profiling

Assembly Algorithm

The assembly process operates independently on each cell barcode. The output for each cell barcode is a set of assembled contigs that represent the best estimate of transcript sequences present, along with per-base quality value estimates, and the number of UMIs and reads supporting each contig. The assembly algorithm proceeds through the following steps:

Read Trimming

Trim known adapter and primer sequences from the 5’ and 3’ ends of reads using the cutadapt tool. This tool uses Smith-Waterman alignment and allows for a small number of differences from the expected primer sequences.

Read Filtering

The FILTER_VDJ_READS stage aligns reads to all the V(D)J gene segments included in the reference. Read-pairs that contain one or more exact 20bp match to a reference segment are included in the set of reads to be assembled. These mappings are not full alignments and are only used for filtering reads prior to assembly.

UMI Filtering

The FILTER_UMIS stage analyzes the distribution of reads per UMI, to determine a threshold for 'real' UMIs that likely result from correct cDNA production and enrichment, vs 'bad' UMIs that may result from error processes in the assay.

Assembly

The ASSEMBLE_VDJ stage performs de novo assembly of reads from each Cell Barcode independently. The assembler will downsample input reads to an average of 400 reads/UMI to avoid artifacts caused by extremely high coverage. The assembler only uses reads from UMIs that pass the FILTER_UMIS reads per UMI threshold and are detected as V(D)J reads by the FILTER_VDJ_READS stage.

The assembly algorithm is outlined here:

  1. Build a De Bruijn graph (K=47) of the sequences in the set of accepted read pairs. A kmer must be included in 2 or more reads to be included in the graph. Track the number of reads and UMIs that contained each kmer in the graph.
  2. Sort graph nodes by UMI support.
  3. Get the best-supported node, and find "high quality" paths from that node, extending in both directions. "High quality" means that branched paths are followed only if the branching base has at least some minimum quality.
  4. Invalidate all nodes on or near the returned paths.
  5. If all nodes have been invalidated, continue to step 6. Otherwise, find the next best-supported (and not invalidated) node and continue at step 3.
  6. Sort the paths found in Step 3 by read support.
  7. Assign UMIs to paths: start from the strongest path and assign each UMI to that path if the total read score of reads aligned against that path is within score_factor from the total score of reads aligned againt the best path for that UMI.
  8. Remove paths that have no UMIs assigned in Step 7.

Each of the supported paths through the De Bruijn graph in Step 8 are emitted as contig sequences from the assembler. Input read pairs that contributed to each contig are mapped to the final contig sequence, and emitted in the all_contig.bam output file.

Assembly Quality Values

Each base in the assembled contigs is assigned a Phred-scaled Quality Value (QV), representing an estimate of the probability of an error at that base. The QV is assigned with a hierarchical model that accounts for the errors in reverse transcription (RT), that will affect all reads with the same UMI, and sequencing errors which affect individual reads. The sequencing error model uses the reported sequencer QVs. At recommended sequencing depths, many reads per UMI are observed, so sequencing errors in individual reads a corrected rapidly. We estimate that the V(D)J RT reaction has error rate of 1e-4 per base, so assembled bases that are covered by a single UMI will be assigned Q40, and bases covered by >= 2 UMIs will be assigned Q60.