Chromium Single Cell ATAC

Cell Ranger ATAC2.1, printed on 10/05/2024

The `cellranger-atac count` pipeline outputs several CSV files which contain
automated secondary analysis results. A subset of these results are used to
render the Cell Clustering view in the run
summary.

Before clustering the cells, Latent Semantic Analysis (LSA) is run on the normalized filtered peak-barcode matrix to reduce the number of feature (peak) dimensions. This produces a projection of each cell onto the first N components (default N=15). One may alternatively choose Principal Component Analysis (PCA) or Probabilistic Latent Semantic Analysis (PLSA) to perform dimensionality reduction in the pipeline. All of these methods provide the following basic CSV output files. Note that the PLSA algorithm in 2.0 is restricted to one thread due to technical reasons and computational performance of dimensionality reduction is likely to be affected.

$ cd /home/jdoe/runs/sample345/outs $ head -2 analysis/lsa/15_components/projection.csv Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10,PC-11,PC-12,PC-13,PC-14,PC-15 AAATGAGCAATCAGGG-1,-2.0256188855237585,-20.464971914743963,2.4066208658862194,-0.9789882112497361,-0.09345960806751374,-8.483300343102174,-5.672765504454421,18.312955842984056,5.6927438340737195,-3.0378744705134992,0.3959335790734238,-4.93326991505897,9.485264727952154,0.2107363858043646,0.948135821430962

A components matrix is produced which indicates how much each peak contributed to each component.

$ head -2 analysis/lsa/15_components/components.csv PC,chr1:9695143-9697582,chr1:9698212-9701041,... 1,-0.5482991923678618,-0.6374211593177428,...

The third file contains the peaks with the highest dispersion that were selected for use in the principal component calculations.

$ head -5 analysis/lsa/15_components/features_selected.csv Feature 1,chr21:42879060-42879960 2,chr21:44817476-44818342 3,chr21:46323900-46324631 4,chr21:37072809-37073658

The fourth file records the proportion of total variance explained by each principal component. When choosing the number of components that are significant, it is useful to look at the plot of variance explained as a function of component rank - when the numbers start to flatten out, subsequent components are unlikely to represent meaningful variation in the data.

$ head -5 analysis/lsa/15_components/variance.csv PC,Proportion.Variance.Explained 1,0.8452609977911579 2,0.032765042936590785 3,0.026127180307558735 4,0.01667142686188944

The final file computed by Cell Ranger ATAC lists the normalized dispersion of each peak, after binning peaks by their mean expression across the dataset. This provides a useful measure of variability of each peak.

$ head -5 analysis/lsa/15_components/dispersion.csv Feature,Normalized.Dispersion chr1:9695143-9697582,0.02029960904777695 chr1:9698212-9701041,0.10379770925583033 chr1:9825253-9827762,-1.0 chr1:9829746-9830116,25.528012093307737

After running dimensionality reduction, t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are run to visualize cells in a 2-D space.

$ head -5 analysis/tsne/2_components/projection.csv Barcode,TSNE-1,TSNE-2 AAATGAGCAATCAGGG-1,1.552159628302055,4.434829693735686 AACAAAGCACCTATTT-1,3.2188609791527667,0.03569781940043248 AACCTTTCAATGATGA-1,-3.8319704788291475,-1.092848944953291 AACTTGGCATGGCCGT-1,-4.226692514189564,0.3351938808086092 $ head -5 analysis/umap/2_components/projection.csv Barcode,UMAP-1,UMAP-2 AAACTCGTCCAGTTAG-1,2.1193044,2.5859628 AAATGAGCAATCAGGG-1,0.37983754,5.982194 AACAAAGCACCTATTT-1,-1.2387826,5.6922545 AACCGATCAGCAACCC-1,0.16549169,6.803529

Clustering is then run to group cells together that have similar accessibility
profiles, based on their projection into lower dimensional space. Graph-based
clustering (under `graphclust`

) is run once as it does not require
prespecification of the number of clusters. For PCA, K-means (under
`kmeans`

) is run for many values of K=2,...,N where K corresponds to
the number of clusters. For LSA or PLSA, spherical
k-means
(under `kmeans`

) is run over the same range of K. By default N=5.
The corresponding results for each K is separated into its own directory.

$ ls analysis/clustering graphclust kmeans_2_clusters kmeans_3_clusters kmeans_4_clusters kmeans_5_clusters

For each clustering, `cellranger-atac` produces cluster
assignments for each cell.

$ head -5 analysis/clustering/kmeans_3_clusters/clusters.csv Barcode,Cluster AAATGAGCAATCAGGG-1,2 AACAAAGCACCTATTT-1,1 AACCTTTCAATGATGA-1,3 AACTTGGCATGGCCGT-1,3

Prior to differential analysis, `cellranger-atac` produces a
peak-barcode matrix and a transcription factor-barcode matrix of counts as
described in
Matrices.
`cellranger-atac` then produces a table indicating which
peaks and transcription factor motifs are differentially accessible in each
cluster relative to all other clusters, as per the algorithms described
here.
For each feature, whether it is peak or transcription factor motif, we compute
three values per cluster:

- The mean cut site counts per cell pooled in feature in cluster
*i*. - The log2 fold-change of this feature in cluster
*i*relative to other clusters. - The p-value denoting significance of this feature in cluster
*i*relative to other clusters, adjusted to account for the number of hypotheses for the feature type being tested. So p-value for peaks are adjusted based on the number of peaks being tested, while p-values for transcription factor motifs are adjusted based on the number of transcription factor motifs.

This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.

$ head -5 analysis/enrichment/kmeans_3_clusters/differential_expression.csv Feature ID,Feature Name,Cluster 1 Mean Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value chr1:9695129-9697582,chr1:9695129-9697582,0.014098403818774368,-5.823451487250574,2.2659671842098193e-06,4.185745651762137e-09,-1.3874516676069444,0.5918812904596457,1.9512762483589925,7.238430090771634,5.00258305609651e-09 chr1:9698210-9701041,chr1:9698210-9701041,0.013761153212430422,-6.1502095503083165,7.855686702156565e-07,0.046489553517204636,-3.0232327143356246,0.01647646310191049,2.2844378973176838,6.5025499776936115,4.703658999567952e-13 . . . AHR_HUMAN.H11MO.0.B,AHR_HUMAN.H11MO.0.B,1.5229979744677225e-09,-0.558490289359965,1.0,1.5229979744575502e-09,1.41990325445066,1.0,1.5229979744838465e-09,2.5 360529002402097,1.0 AIRE_HUMAN.H11MO.0.C,AIRE_HUMAN.H11MO.0.C,382.4895824324451,-1.366896997726535,0.007214824200990991,4098.191143669588,0.031632664734601475,1.0,124.229272550 17468,2.136369782757689,0.0015585067057439586

Notice that the table for any specific clustering includes differential analysis results for both peaks and transcription factor motifs.

Cell Ranger ATAC does not produce the tf-barcode matrix for multi-species experiments or if the motifs.pfm file is missing from the reference package (for example in custom references). The pipeline cannot perform differential analysis for transcription factor motifs in these cases, so the output file will only contain analysis results on peaks. |