Software  ›   pipelines

# Run Analysis

The count, aggr and reanalyze pipelines output several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis View in the run summary.

## Dimensionality Reduction

Before clustering the cells, Principal Component Analysis (PCA) is run on the normalized filtered gene-barcode matrix to reduce the number of feature (gene) dimensions. This produces a projection of each cell onto the first N principal components. By default N=10; when running reanalyze, you can choose to increase it.

$cd /home/jdoe/runs/sample345/outs$ head -2 analysis/pca/10_components/projection.csv
Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10
AAACATACAACGAA-1,-0.2765,-5.7056,6.5324,-12.2736,-1.4390,-1.1656,-0.1754,-2.9748,3.3785,1.6539


This also produces a components matrix which indicates how much each gene contributed to each principal component.

$head -2 analysis/pca/10_components/components.csv PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310 1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104  This also produces the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank - when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data. $ head -5 analysis/pca/10_components/variance.csv
PC,Proportion.Variance.Explained
1,0.0056404970744118104
2,0.0038897311237809061
3,0.0028803714818085419
4,0.0020830581822081206


We also compute the normalized dispersion of each gene, after binning genes by their mean expression across the dataset. This provides a useful measure of variability of each gene.

$head -5 analysis/pca/10_components/dispersion.csv Gene,Normalized.Dispersion ENSG00000228327,2.0138970131886671 ENSG00000237491,1.3773662040549017 ENSG00000177757,-0.28102027567224191 ENSG00000225880,1.9887312950109921  # Visualization After running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize cells in a 2-D space. $ head -5 analysis/tsne/2_components/projection.csv
Barcode,TSNE-1,TSNE-2
AAACATACAACGAA-1,-13.5494,1.4674
AAACATACTACGCA-1,-2.7325,-10.6347
AAACCGTGTCTCGC-1,12.9590,-1.6369
AAACGCACAACCAC-1,-9.3585,-6.7300


## Clustering

Clustering is then run to group cells together that have similar expression profiles, based on their projection into PCA space. Graph-based clustering (under graphclust) is run once as it does not require prespecification of the number of clusters. K-means (under kmeans) is run for many values of K=2,...,N where K corresponds to the number of clusters. By default N=10; when running reanalyze, you can choose to increase it. The corresponding results for each K is separated into its own directory.

$ls analysis/clustering graphclust kmeans_10_clusters kmeans_2_clusters kmeans_3_clusters kmeans_4_clusters kmeans_5_clusters kmeans_6_clusters kmeans_7_clusters kmeans_8_clusters kmeans_9_clusters For each clustering, cellranger produces cluster assignments for each cell. $ head -5 analysis/clustering/kmeans_3_clusters/clusters.csv
Barcode,Cluster
AAACATACAACGAA-1,2
AAACATACTACGCA-1,2
AAACCGTGTCTCGC-1,1
AAACGCACAACCAC-1,3


## Differential Expression

cellranger also produces a table indicating which genes are differentially expressed in each cluster relative to all other clusters. For each gene we compute three values per cluster:

• The mean UMI counts per cell of this gene in cluster i
• The log2 fold-change of this gene's expression in cluster i relative to other clusters
• The p-value denoting significance of this gene's expression in cluster i relative to other clusters, adjusted to account for the number of hypotheses (i.e. genes) being tested.

This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.