Space Ranger1.1, printed on 11/17/2024
Space ranger uses an aligner called STAR that performs splicing-aware alignment of reads to the genome. Before invoking STAR, Space Ranger trims any residual template switch oligo (TSO) from the 5' end and any poly-A sequence from the 3' end of read 2. Trimming these non-template sequences improves mapping.
After alignment, Space Ranger uses the transcript annotation GTF file to count each read as either exonic, intronic, or intergenic and also whether reads align confidently to the genome. Space Ranger counts a read as exonic if at least 50% of it intersects an exon. If a read is not exonic and intersects an intron, it is counted as intronic, otherwise the read is counted as intergenic.
For reads that align to a single exonic locus, but also align to 1 or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus with MAPQ 255.
Space Ranger further aligns exonic reads to annotated transcripts, looking for compatibility. A read that is compatible with the exons of an annotated transcript, and aligned to the same strand, is considered mapped to the transcriptome. If the read is compatible with a single gene annotation, it is considered uniquely (confidently) mapped to the transcriptome. These confidently mapped reads are the only ones considered for UMI counting.
Before counting UMIs, Space Ranger attempts to correct for sequencing errors in the UMI sequences. Reads that were confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene, but their UMIs differ by a single base, that is are Hamming distance 1 apart, then one of the UMIs was likely introduced by a substitution error in sequencing. In this case, the UMI of the less-supported read group is corrected to the UMI with higher support.
Space Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.
After these two filtering steps, each observed barcode, UMI, and gene combination is recorded as a UMI count in the unfiltered feature-barcode matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.
Space Ranger detects spots under the tissue section in the Imaging subpipeline. Only the barcodes associated to these under tissue spots are used for downstream analyses.
In order to reduce the gene expression matrix to its most important features, Space Ranger uses Principal Components Analysis (PCA) to change the dimensionality of the dataset from (spots x genes) to (spots x M) where M is 10. The pipeline uses a python implementation of IRLBA algorithm, (Baglama & Reichel, 2005), which is modified to reduce memory consumption.
For visualizing data in 2-d space, Space Ranger passes the PCA-reduced data into t-Stochastic Neighbor Embedding (t-SNE), a nonlinear dimensionality reduction method (Van der Maaten, 2014). The C++ reference implementation by Van der Maaten was modified to take a PRNG seed for determinism. The runtime is also decreased by fixing the number of output dimensions at compile time to 2 or 3.
Space Ranger also supports Uniform Manifold Approximation and Projection (UMAP), which estimates a topology of the high dimensional data and uses this information to estimate a low dimensional embedding that preserves relationships between datapoints (McInnes & Healy, 2018). The pipeline uses the python implementation of this algorithm by Leland McInnes. UMAP coordinates are available in the pipeline output, but not displayed in the web summary.
Space Ranger uses two different methods for clustering spots by expression similarity, both of which operate in the PCA representation.
The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where spots are linked if they are among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization (LMO; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of spots. An additional cluster-merging step is done: Perform hierarchical clustering on the cluster-medoids in PCA space and merge pairs of sibling clusters if there are no genes differentially expressed between them (with B-H adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge.
The use of LMO to cluster spots was inspired by a similar method in the R package Seurat.
Space Ranger also performs traditional K-means clustering across a range of K values, where K is the preset number of clusters.
In order to identify genes with expression specific to each cluster, Space Ranger tests (each gene and each cluster) for whether the in-cluster mean differs from the out-of-cluster mean.
This difference is tested using the quick and simple method sSeq (Yu, Huber, & Vitek, 2013), which employs a negative binomial exact test. When the counts become large, Space Ranger switches to the fast asymptotic beta test used in edgeR (Robinson & Smyth, 2007). For each cluster, the algorithm is run on that cluster versus all other spots, yielding a list of genes that are differentially expressed in that cluster relative to the rest of the sample.
Space Ranger's implementation differs slightly from that in the paper. In the sSeq paper, the authors recommend using DESeq's geometric mean-based definition of library size (Love, Huber & Anders, 2014). Space Ranger instead computes relative library size as the total UMI counts for each cell divided by the median UMI counts per cell. As with sSeq, normalization is implicit in that the per-cell library-size parameter is incorporated as a factor in the exact-test probability calculations.
Baglama, J. & Reichel, L., Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal on Scientific Computing 27, 19–42 (2005).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, (2008).
Love, M. L., Huber, W. & Anders, S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, number 550 (2014).
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv (2018).
Robinson, M. D. & Smyth, G. K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332 (2007).
Van der Maaten, L., Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15, 3221-3245 (2014).
Yu, D., Huber, W. & Vitek, O., Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29, 1275–1282 (2013).