Cell Ranger1.2, printed on 05/30/2020
The count, aggr and reanalyze pipelines output several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis View in the run summary.
Before clustering the cells, Principal Component Analysis (PCA) is run on the normalized filtered gene-barcode matrix to reduce the number of feature (gene) dimensions. This produces a projection of each cell onto the first N principal components. By default N=10; when running reanalyze, you can choose to increase it.
$ cd /home/jdoe/runs/sample345/outs $ head -2 analysis/pca/10_components/projection.csv Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10 AAACATACAACGAA-1,-0.2765,-5.7056,6.5324,-12.2736,-1.4390,-1.1656,-0.1754,-2.9748,3.3785,1.6539
This also produces a components matrix which indicates how much each gene contributed to each principal component.
$ head -2 analysis/pca/10_components/components.csv PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310 1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104
This also produces the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank - when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data.
$ head -5 analysis/pca/10_components/variance.csv PC,Proportion.Variance.Explained 1,0.0056404970744118104 2,0.0038897311237809061 3,0.0028803714818085419 4,0.0020830581822081206
We also compute the normalized dispersion of each gene, after binning genes by their mean expression across the dataset. This provides a useful measure of variability of each gene.
$ head -5 analysis/pca/10_components/dispersion.csv Gene,Normalized.Dispersion ENSG00000228327,2.0138970131886671 ENSG00000237491,1.3773662040549017 ENSG00000177757,-0.28102027567224191 ENSG00000225880,1.9887312950109921
After running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize cells in a 2-D space.
$ head -5 analysis/tsne/2_components/projection.csv Barcode,TSNE-1,TSNE-2 AAACATACAACGAA-1,-13.5494,1.4674 AAACATACTACGCA-1,-2.7325,-10.6347 AAACCGTGTCTCGC-1,12.9590,-1.6369 AAACGCACAACCAC-1,-9.3585,-6.7300
K-means clustering is then run to group cells together that have similar expression profiles, based on their projection into PCA space. K-means is run for many values of K=2,...,N where K corresponds to the number of clusters. By default N=10; when running reanalyze, you can choose to increase it. The corresponding results for each K is separated into its own directory.
$ ls analysis/kmeans 10_clusters 3_clusters 5_clusters 7_clusters 9_clusters 2_clusters 4_clusters 6_clusters 8_clusters
For each K, cellranger produces cluster assignments for each cell.
$ head -5 analysis/kmeans/3_clusters/clusters.csv Barcode,Cluster AAACATACAACGAA-1,2 AAACATACTACGCA-1,2 AAACCGTGTCTCGC-1,1 AAACGCACAACCAC-1,3
cellranger also produces a table indicating which genes are differentially expressed in each cluster relative to the other clusters. For each gene we compute three values per cluster:
This is located in a different directory than the kmeans results, but follows the same structure, with each value of K separated into its own directory.
$ head -5 analysis/diffexp/3_clusters/differential_expression.csv
Gene ID,Gene Name,Cluster 1 Mean UMI Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean UMI Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean UMI Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value ENSG00000228327,RP11-206L10.2,0.0056858989363338264,2.6207666981569986,0.00052155805898912184,0.0,-0.75299726644507814,0.64066099091888962,0.00071455453829430329,-2.3725403666493312,0.0043023680184636837 ENSG00000237491,RP11-206L10.9,0.00012635330969630726,-0.31783275717885928,0.40959138980118809,0.0,3.8319652342760779,0.11986963938734894,0.0,0.56605908868652577,0.39910771338768203 ENSG00000177757,FAM87B,0.0,-2.9027952579000154,0.0,0.0,3.2470027335549219,0.19129034227967889,0.00071455453829430329,3.1510215894076818,0.0 ENSG00000225880,LINC00115,0.0003790599290889218,-5.71015017995762,8.4751637615375386e-28,0.20790015775229512,7.965820981010868,1.3374521290889345e-46,0.0017863863457357582,-2.2065304152104019,0.00059189960914085744
If you analyzed a multi-species experiment, the analysis output will look different. For example, the human-mouse mixing experiment is run to verify system functionality. It consists of mixing approximately 600 human (HEK293T) cells and 600 mouse (3T3) cells in a 1:1 ratio.
cellranger produces a single analysis CSV file indicating whether each GEM contains only a single human cell (hg19), a single mouse cell (mm10) or multiple mouse and human cells (Multiplet).
$ cd /home/jdoe/runs/sample345/outs $ head -5 analysis/gem_classification.csv barcode,hg19,mm10,call AAACATACACCTCC-1,3,815,mm10 AAACATACACCTGA-1,14,780,mm10 AAACATACACGTGT-1,2,439,mm10 AAACATACAGACTC-1,700,776,Multiplet