Cell Ranger 1.3, printed on 11/18/2019
Cell Ranger uses an aligner called STAR, which aligns reads simultaneously to the genome and transcriptome. Cell Ranger uses both types of alignments to determine whether a read could be confidently associated with a transcript and/or gene. See the BAM documentation for details on the definition of confident mapping.
In order to make analysis of the expression data more tractable, Cell Ranger uses Principal Components Analysis (PCA) to reduce the dimensionality of the dataset from (cells x genes) to (cells x M) where M is a user-selectable number of principal components. The pipeline uses the IRLBA algorithm, (Baglama & Reichel, 2005) modified to reduce memory consumption. The reanalyze pipeline allows the user to further reduce the data by randomly subsampling the cells and/or selecting genes by their dispersion across the dataset.
Cell Ranger uses two different methods for clustering cells by expression similarity, both of which operate in the PCA space.
The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where cells are linked if they among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization, (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008) an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of cells. An additional cluster-merging step is done: Perform hierarchical clustering on the cluster-medoids in PCA space and merge pairs of sibling clusters if there are no genes differentially expressed between them (with B-H adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge.
Cell Ranger also performs traditional K-means clustering across a range of K values, where K is the preset number of clusters. In the web summary prior to 1.3.0, the default selected value of K is that which yields the best Davies-Bouldin Index, a rough measure of clustering quality.
In order to find differentially expressed genes between groups of cells, Cell Ranger uses the quick and simple method sSeq, (Yu, Huber, & Vitek, 2013) which employs a negative binomial exact test. When the counts become large, Cell Ranger switches to the fast asymptotic beta test used in edgeR. (Robinson & Smyth, 2007) For each cluster, the algorithm is run on that cluster versus all other cells, yielding a list of genes that are differentially expressed in that cluster relative to the rest of the sample.
Baglama, J. & Reichel, L. Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal on Scientific Computing 27, 19–42 (2005).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, (2008).
Robinson, M. D. & Smyth, G. K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332 (2007). Link to edgeR source.
Van der Maaten, L.J.P. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15, 3221-3245 (2014).
Yu, D., Huber, W. & Vitek, O. Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29, 1275–1282 (2013).