Chromium Single Cell ATAC

Cell Ranger ATAC1.0, printed on 06/20/2021

The `count` pipeline outputs several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Cell Clustering view in the run summary.

Before clustering the cells, Latent Semantic Analysis (LSA) is run on the normalized filtered peak-barcode matrix to reduce the number of feature (peak) dimensions. This produces a projection of each cell onto the first N components (default N=15). One may alternatively choose Principal Component Analysis (PCA) or Probabilistic Latent Semantic Analysis (PLSA) to perform dimensionality reduction in the pipeline. All of these methods provide the following basic CSV output files.

$ cd /home/jdoe/runs/sample345/outs $ head -2 analysis/lsa/15_components/projection.csv Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10,PC-11,PC-12,PC-13,PC-14,PC-15 AAATGAGCAATCAGGG-1,-2.0256188855237585,-20.464971914743963,2.4066208658862194,-0.9789882112497361,-0.09345960806751374,-8.483300343102174,-5.672765504454421,18.312955842984056,5.6927438340737195,-3.0378744705134992,0.3959335790734238,-4.93326991505897,9.485264727952154,0.2107363858043646,0.948135821430962

This also produces a components matrix which indicates how much each peak contributed to each component.

$ head -2 analysis/lsa/15_components/components.csv PC,chr1:9695143-9697582,chr1:9698212-9701041,... 1,-0.5482991923678618,-0.6374211593177428,...

This also produces the proportion of total variance explained by each component. When choosing the number of components that are significant, it is useful to look at the plot of variance explained as a function of component rank - when the numbers start to flatten out, subsequent components are unlikely to represent meaningful variation in the data.

$ head -5 analysis/lsa/15_components/variance.csv PC,Proportion.Variance.Explained 1,0.8452609977911579 2,0.032765042936590785 3,0.026127180307558735 4,0.01667142686188944

We also compute the normalized dispersion of each peak, after binning peaks by their mean expression across the dataset. This provides a useful measure of variability of each peak.

$ head -5 analysis/lsa/15_components/dispersion.csv Feature,Normalized.Dispersion chr1:9695143-9697582,0.02029960904777695 chr1:9698212-9701041,0.10379770925583033 chr1:9825253-9827762,-1.0 chr1:9829746-9830116,25.528012093307737

After running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize cells in a 2-D space.

$ head -5 analysis/tsne/2_components/projection.csv Barcode,TSNE-1,TSNE-2 AAATGAGCAATCAGGG-1,1.552159628302055,4.434829693735686 AACAAAGCACCTATTT-1,3.2188609791527667,0.03569781940043248 AACCTTTCAATGATGA-1,-3.8319704788291475,-1.092848944953291 AACTTGGCATGGCCGT-1,-4.226692514189564,0.3351938808086092

Clustering is then run to group cells together that have similar accessibility profiles, based on their projection into lower dimensional space.
Graph-based clustering (under `graphclust`) is run once as it does not require prespecification of the number of clusters. For PCA, K-means (under `kmeans`) is run for many values of K=2,...,N where K corresponds to the number of clusters. For LSA or PLSA, K-medoids (under `kmedoids`) is run over the same range of K. By default N=10. The corresponding results for each K is separated into its own directory.

$ ls analysis/clustering graphclust kmedoids_2_clusters kmedoids_4_clusters kmedoids_6_clusters kmedoids_8_clusters kmedoids_10_clusters kmedoids_3_clusters kmedoids_5_clusters kmedoids_7_clusters kmedoids_9_clusters

For each clustering, `cellranger-atac` produces cluster assignments for each cell.

$ head -5 analysis/clustering/kmedoids_3_clusters/clusters.csv Barcode,Cluster AAATGAGCAATCAGGG-1,2 AACAAAGCACCTATTT-1,1 AACCTTTCAATGATGA-1,3 AACTTGGCATGGCCGT-1,3

Prior to differential analysis, `cellranger-atac` produces a transcription factor-barcode matrix of counts as described in Matrices.
`cellranger-atac` also produces a table indicating which transcription factor motifs are differentially active in each cluster relative to all other clusters.
For each transcription factor motif we compute three values per cluster:

- The mean cut site counts per cell pooled in peaks associated with this transcription factor motif in cluster
*i*. - The log2 fold-change of this transcription factor motif's activity in cluster
*i*relative to other clusters. - The p-value denoting significance of this transcription factor motif's activity in cluster
*i*relative to other clusters, adjusted to account for the number of hypotheses (i.e. transcription factor motifs) being tested.

This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.

$ head -5 analysis/diffexp/kmeans_3_clusters/differential_expression.csvGene ID,Gene Name,Cluster 1 Mean UMI Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean UMI Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean UMI Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value ENSG00000228327,RP11-206L10.2,0.0056858989363338264,2.6207666981569986,0.00052155805898912184,0.0,-0.75299726644507814,0.64066099091888962,0.00071455453829430329,-2.3725403666493312,0.0043023680184636837 ENSG00000237491,RP11-206L10.9,0.00012635330969630726,-0.31783275717885928,0.40959138980118809,0.0,3.8319652342760779,0.11986963938734894,0.0,0.56605908868652577,0.39910771338768203 ENSG00000177757,FAM87B,0.0,-2.9027952579000154,0.0,0.0,3.2470027335549219,0.19129034227967889,0.00071455453829430329,3.1510215894076818,0.0 ENSG00000225880,LINC00115,0.0003790599290889218,-5.71015017995762,8.4751637615375386e-28,0.20790015775229512,7.965820981010868,1.3374521290889345e-46,0.0017863863457357582,-2.2065304152104019,0.00059189960914085744

- 2.0 (latest)
- 1.2
- 1.1
- Cell Ranger ATAC v
**1.0**