HOME  ›   pipelines

# Gene Expression Algorithms Overview

## Alignment

A full length cDNA construct is flanked by the 30 bp template switch oligo (TSO) sequence, AAGCAGTGGTATCAACGCAGAGTACATGGG, on the 5' end and poly-A on the 3' end. Some fraction of sequencing reads are expected to contain either or both of these sequences, depending on the fragment size distribution of the sequencing library. Reads derived from short RNA molecules are more likely to contain either or both TSO and poly-A sequence than longer RNA molecules.

Since the presence of non-template sequence in the form of either template switch oligo (TSO) or poly-A, low-complexity ends confound read mapping, TSO sequence is trimmed from the 5' end of read 2 and poly-A is trimmed from the 3' end prior to alignment. Trimming improves the sensitivity of the assay as well as the computational efficiency of the software pipeline.

Tags ts:i and pa:i in the output BAM files indicate the number of TSO nucleotides trimmed from the 5' end of read 2 and the number of poly-A nucleotides trimmed from the 3' end. The trimmed bases are present in the sequence of the BAM record and are soft clipped in the CIGAR string.

### Genome Alignment

Space ranger uses an aligner called STAR that performs splicing-aware alignment of reads to the genome. After alignment, Space Ranger uses the transcript annotation GTF file to count each read as either exonic, intronic, or intergenic and also whether reads align confidently to the genome. Space Ranger counts a read as exonic if at least 50% of it intersects an exon. If a read is not exonic and intersects an intron, it is counted as intronic, otherwise the read is counted as intergenic.

For reads that align to a single exonic locus, but also align to 1 or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus with MAPQ 255.

### Transcriptome Alignment

Space Ranger further aligns exonic reads to annotated transcripts, looking for compatibility. A read that is compatible with the exons of an annotated transcript, and aligned to the same strand, is considered mapped to the transcriptome. If the read is compatible with a single gene annotation, it is considered uniquely (confidently) mapped to the transcriptome. These confidently mapped reads are the only ones considered for UMI counting.

### UMI Counting

Before counting UMIs, Space Ranger attempts to correct for sequencing errors in the UMI sequences. Reads that were confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene, but their UMIs differ by a single base (i.e., are Hamming distance 1 apart), then one of the UMIs was likely introduced by a substitution error in sequencing. In this case, the UMI of the less-supported read group is corrected to the UMI with higher support.

Space Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.

After these two filtering steps, each observed barcode, UMI, gene combination is recorded as a UMI count in the unfiltered feature-barcode matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.

## Detecting Tissue Barcodes

Space Ranger detects spots under the tissue section in the Imaging subpipeline. Only the barcodes associated to these under tissue spots are used for downstream analyses.

## Secondary Analysis of Gene Expression

### Dimensionality Reduction

In order to reduce the gene expression matrix to its most important features, Space Ranger uses Principal Components Analysis (PCA) to change the dimensionality of the dataset from (spots x genes) to (spots x M) where M is 10. The pipeline uses a python implementation of IRLBA algorithm, (Baglama & Reichel, 2005), which is modified to reduce memory consumption.

#### t-SNE

For visualizing data in 2-d space, Space Ranger passes the PCA-reduced data into t-Stochastic Neighbor Embedding (t-SNE), a nonlinear dimensionality reduction method (Van der Maaten, 2014). The C++ reference implementation by Van der Maaten was modified to take a PRNG seed for determinism. The runtime is also decreased by fixing the number of output dimensions at compile time to 2 or 3.

#### UMAP

Space Ranger also supports Uniform Manifold Approximation and Projection (UMAP), which estimates a topology of the high dimensional data and uses this information to estimate a low dimensional embedding that preserves relationships between datapoints (McInnes & Healy, 2018). The pipeline uses the python implementation of this algorithm by Leland McInnes. UMAP coordinates are available in the pipeline output, but not displayed in the web summary.

### Clustering

Space Ranger uses two different methods for clustering spots by expression similarity, both of which operate in the PCA representation.

#### Graph-based

The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where spots are linked if they are among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization (LMO; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of spots. An additional cluster-merging step is done: Perform hierarchical clustering on the cluster-medoids in PCA space and merge pairs of sibling clusters if there are no genes differentially expressed between them (with B-H adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge.

The use of LMO to cluster spots was inspired by a similar method in the R package Seurat.

#### K-Means

Space Ranger also performs traditional K-means clustering across a range of K values, where K is the preset number of clusters.

### Differential Expression

In order to identify genes whose expression is specific to each cluster, Space Ranger tests, for each gene and each cluster, whether the in-cluster mean differs from the out-of-cluster mean.

In order to find differentially expressed genes between groups of spots, Space Ranger uses the quick and simple method sSeq (Yu, Huber, & Vitek, 2013), which employs a negative binomial exact test. When the counts become large, Space Ranger switches to the fast asymptotic beta test used in edgeR (Robinson & Smyth, 2007). For each cluster, the algorithm is run on that cluster versus all other spots, yielding a list of genes that are differentially expressed in that cluster relative to the rest of the sample.

Space Ranger's implementation differs slightly from that in the paper. In the sSeq paper, the authors recommend using DESeq's geometric mean-based definition of library size (Love, Huber & Anders, 2014). Space Ranger instead computes relative library size as the total UMI counts for each spot divided by the median UMI counts per spot. As with sSeq, normalization is implicit in that the per-spot library-size parameter is incorporated as a factor in the exact-test probability calculations.

## References

Baglama, J. & Reichel, L., Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal on Scientific Computing 27, 19–42 (2005).

Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, (2008).

Love, M. L., Huber, W. & Anders, S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, number 550 (2014).

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv (2018).

Robinson, M. D. & Smyth, G. K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332 (2007).

Van der Maaten, L., Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15, 3221-3245 (2014).

Yu, D., Huber, W. & Vitek, O., Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29, 1275–1282 (2013).