Chromium Single Cell Multiome ATAC + Gene Exp.

Cell Ranger ARC1.0, printed on 05/18/2022

The multiomic secondary analysis results from Cell Ranger ARC involve the following analysis steps:

And the output of the secondary analysis resides in the `out/analysis` directory with the following structure:

analysis ├── clustering │ ├── atac │ │ ├── graphclust │ │ │ ├── clusters.csv │ │ │ ├── differential_accessibility.csv │ │ │ └── differential_expression.csv │ │ ├── kmeans_2_clusters │ │ │ ├── clusters.csv │ │ │ ├── differential_accessibility.csv │ │ │ └── differential_expression.csv │ │ ├── kmeans_3_clusters │ │ │ ├── clusters.csv │ │ │ ├── differential_accessibility.csv │ │ │ └── differential_expression.csv │ │ ├── kmeans_4_clusters │ │ │ ├── clusters.csv │ │ │ ├── differential_accessibility.csv │ │ │ └── differential_expression.csv │ │ └── kmeans_5_clusters │ │ ├── clusters.csv │ │ ├── differential_accessibility.csv │ │ └── differential_expression.csv │ └── gex │ ├── graphclust │ │ ├── clusters.csv │ │ ├── differential_accessibility.csv │ │ └── differential_expression.csv │ ├── kmeans_2_clusters │ │ ├── clusters.csv │ │ ├── differential_accessibility.csv │ │ └── differential_expression.csv │ ├── kmeans_3_clusters │ │ ├── clusters.csv │ │ ├── differential_accessibility.csv │ │ └── differential_expression.csv │ ├── kmeans_4_clusters │ │ ├── clusters.csv │ │ ├── differential_accessibility.csv │ │ └── differential_expression.csv │ └── kmeans_5_clusters │ ├── clusters.csv │ ├── differential_accessibility.csv │ └── differential_expression.csv ├── dimensionality_reduction │ ├── atac │ │ ├── lsa_components.csv │ │ ├── lsa_dispersion.csv │ │ ├── lsa_features_selected.csv │ │ ├── lsa_projection.csv │ │ ├── lsa_variance.csv │ │ ├── tsne_projection.csv │ │ └── umap_projection.csv │ └── gex │ ├── pca_components.csv │ ├── pca_dispersion.csv │ ├── pca_features_selected.csv │ ├── pca_projection.csv │ ├── pca_variance.csv │ ├── tsne_projection.csv │ └── umap_projection.csv ├── feature_linkage │ ├── feature_linkage.bedpe │ └── feature_linkage_matrix.h5 └── tf_analysis ├── filtered_tf_bc_matrix │ ├── barcodes.tsv.gz │ ├── matrix.mtx.gz │ └── motifs.tsv ├── filtered_tf_bc_matrix.h5 └── peak_motif_mapping.bed

The primary dimensionality reduction method is Principal Component Analysis (PCA) for GEX and Latent Semantic Analysis (LSA) for ATAC.
PCA is run on the normalized filtered gene-barcode matrix to reduce the number of feature (gene) dimensions. Only gene expression features are used as PCA features.
Likewise, LSA is run on the normalized filtered peak-barcode matrix. PCA (LSA) analysis produces four output files in the directory `analysis/dimensionality_reduction/gex/` with prefix of `pca_` (`lsa_`) in the sub-directory `gex/` (` atac/`).

The first is a projection of each cell onto the first N principal components (default GEX: N=10; ATAC: N=15).

$ cd /home/jdoe/runs/sample345/outs $ head -2 analysis/dimensionality_reduction/gex/pca_projection.csv Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10 AAACAGCCAAGCTAAA-1,-17.688234040781857,3.7950159394896508,0.12134779569343124,8.891889169739237,1.6561792607584174,3.3562135574248586,2.1045793835246203,-5.304589200171137,-0.5285869980603226,-2.316716491709393

$ head -2 analysis/dimensionality_reduction/atac/lsa_projection.csv Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10,PC-11,PC-12,PC-13,PC-14,PC-15 AAACAGCCAAGCTAAA-1,-20.240988995299652,-9.98961195205192,-3.975713841955313,-3.6970519526233816,0.5924742121181492,0.2541630680205914,-1.8285930634181444,1.3091645487857684,-0.1932357739169616,0.09950491463448573,1.3779137917059847,-1.5110824109207137,-0.421621592950534,-0.0952461164327349,-0.20614805513560971

The second file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.

$ head -2 analysis/dimensionality_reduction/gex/pca_components.csv PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310 1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104

$ head -2 analysis/dimensionality_reduction/atac/lsa_components.csv PC,chr1:9695143-9697582,chr1:9698212-9701041,... 1,-0.5482991923678618,-0.6374211593177428,...

The third file records the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank - when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data.

$ head -5 analysis/dimensionality_reduction/gex/pca_variance.csv PC,Proportion.Variance.Explained 1,0.01009176733941617 2,0.0031696809558130652 3,0.002391878968412864 4,0.0020683529204892654

$ head -5 analysis/dimensionality_reduction/atac/lsa_variance.csv PC,Proportion.Variance.Explained 1,0.2210095789548742 2,0.03476394600838236 3,0.005925095778867349 4,0.003582945659343343

The final file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.

$ head -5 analysis/dimensionality_reduction/gex/pca_dispersion.csv Feature,Normalized.Dispersion ENSG00000228327,2.0138970131886671 ENSG00000237491,1.3773662040549017 ENSG00000177757,-0.28102027567224191 ENSG00000225880,1.9887312950109921

$ head -5 analysis/dimensionality_reduction/atac/lsa_dispersion.csv Feature,Normalized.Dispersion chr1:9695143-9697582,0.02029960904777695 chr1:9698212-9701041,0.10379770925583033 chr1:9825253-9827762,-1.0 chr1:9829746-9830116,25.528012093307737

After running PCA or LSA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize cells in a 2-D space.

$ head -5 analysis/dimensionality_reduction/atac/umap_projection.csv Barcode,TSNE-1,TSNE-2 AAACAGCCAAGCTAAA-1,9.2136315704327,-5.795182388646322 AAACAGCCAAGGTAAC-1,-10.9596148671472,17.742914355441265 AAACAGCCAGTAGGTG-1,2.869065977385947,-17.55872285065259 AAACAGCCATAATGTC-1,12.495664530357228,1.9561615760448785

$ head -5 analysis/dimensionality_reduction/gex/umap_projection.csv Barcode,TSNE-1,TSNE-2 AAACAGCCAAGCTAAA-1,11.19783100504234,-32.672655215753544 AAACAGCCAAGGTAAC-1,-24.09848935339985,0.6469769415490979 AAACAGCCAGTAGGTG-1,4.678939739563926,-27.328395716680745 AAACAGCCATAATGTC-1,21.49243070779123,-27.122233774496824

After running PCA or LSA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize cells in a 2-D space.

$ head -5 analysis/dimensionality_reduction/atac/umap_projection.csv Barcode,UMAP-1,UMAP-2 AAACAGCCAAGCTAAA-1,7.3394675,-5.621648 AAACAGCCAAGGTAAC-1,-7.112387,0.9921901 AAACAGCCAGTAGGTG-1,4.2560987,-6.852242 AAACAGCCATAATGTC-1,6.77568,-5.4857235

$ head -5 analysis/dimensionality_reduction/gex/umap_projection.csv Barcode,UMAP-1,UMAP-2 AAACAGCCAAGCTAAA-1,7.312935,-7.3619266 AAACAGCCAAGGTAAC-1,-8.567425,-1.38729 AAACAGCCAGTAGGTG-1,8.221492,-6.5541673 AAACAGCCATAATGTC-1,5.5689363,-7.2709103

The ATAC and GEX data per cell barcode is sparse and clustering the data using the large number of features can help discover different cell populations in the sample. Moreover, the clustering helps us detect differentially accessible peaks or differentially expressed genes in each population.

Clustering is then run to group cells together that have similar expression profiles, based on their
projection into PCA space (GEX) or LSA space (ATAC).
Graph-based clustering (under `graphclust`) is run once as it does not require a pre-specified
number of clusters. K-means (under `kmeans`) is run for many values of K=2,...,N, where K
corresponds to the number of clusters (default N=5).

$ ls analysis/clustering/atac graphclust kmeans_2_clusters kmeans_3_clusters kmeans_4_clusters kmeans_5_clusters

For each clustering, `cellranger-arc` produces cluster assignments for each cell.

$ head -5 analysis/clustering/atac/kmeans_3_clusters/clusters.csv Barcode,Cluster AAACATACAACGAA-1,2 AAACATACTACGCA-1,2 AAACCGTGTCTCGC-1,1 AAACGCACAACCAC-1,3

$ head -5 analysis/clustering/gex/kmeans_3_clusters/clusters.csv Barcode,Cluster AAACATACAACGAA-1,2 AAACATACTACGCA-1,2 AAACCGTGTCTCGC-1,1 AAACGCACAACCAC-1,3

For each clustering setting generated for either ATAC or GEX matrix and by either K-means or graph
clustering method, `cellranger-arc` then produces a table indicating which
genes are differentially expressed (`differential_expression.csv`) and a table
indicating which peaks and transcription factor motifs are differentially accessible
(`differential_accessibility.csv`) in each cluster relative to all other clusters, as per
the algorithms described here.
For each feature, whether it is a gene, peak, or transcription factor motif, we compute these three values per cluster:

- The mean UMI counts per cell of this feature in cluster
*i* - The log2 fold-change of this feature's expression in cluster
*i*relative to all other clusters - The p-value denoting the significance of this feature's expression in cluster
*i*relative to other clusters, adjusted to account for the number of hypotheses (i.e. the number of features) being tested

Both `differential_expression.csv` and `differential_accessibility.csv` are located in the same directory as the clustering results.

$ head -5 analysis/clustering/atac/graphclust/differential_expression.csv Feature ID,Feature Name,Cluster 1 Mean UMI Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean UMI Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean UMI Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value ENSG00000228327,RP11-206L10.2,0.0056858989363338264,2.6207666981569986,0.00052155805898912184,0.0,-0.75299726644507814,0.64066099091888962,0.00071455453829430329,-2.3725403666493312,0.0043023680184636837 ENSG00000237491,RP11-206L10.9,0.00012635330969630726,-0.31783275717885928,0.40959138980118809,0.0,3.8319652342760779,0.11986963938734894,0.0,0.56605908868652577,0.39910771338768203 ENSG00000177757,FAM87B,0.0,-2.9027952579000154,0.0,0.0,3.2470027335549219,0.19129034227967889,0.00071455453829430329,3.1510215894076818,0.0 ENSG00000225880,LINC00115,0.0003790599290889218,-5.71015017995762,8.4751637615375386e-28,0.20790015775229512,7.965820981010868,1.3374521290889345e-46,0.0017863863457357582,-2.2065304152104019,0.00059189960914085744

$ head -5 analysis/clustering/atac/graphclust/differential_accessibility.csv Feature ID,Feature Name,Cluster 1 Mean Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value chr1:9695129-9697582,chr1:9695129-9697582,0.014098403818774368,-5.823451487250574,2.2659671842098193e-06,4.185745651762137e-09,-1.3874516676069444,0.5918812904596457,1.9512762483589925,7.238430090771634,5.00258305609651e-09 chr1:9698210-9701041,chr1:9698210-9701041,0.013761153212430422,-6.1502095503083165,7.855686702156565e-07,0.046489553517204636,-3.0232327143356246,0.01647646310191049,2.2844378973176838,6.5025499776936115,4.703658999567952e-13 . . . AHR_HUMAN.H11MO.0.B,AHR_HUMAN.H11MO.0.B,1.5229979744677225e-09,-0.558490289359965,1.0,1.5229979744575502e-09,1.41990325445066,1.0,1.5229979744838465e-09,2.5 360529002402097,1.0 AIRE_HUMAN.H11MO.0.C,AIRE_HUMAN.H11MO.0.C,382.4895824324451,-1.366896997726535,0.007214824200990991,4098.191143669588,0.031632664734601475,1.0,124.229272550 17468,2.136369782757689,0.0015585067057439586

Notice that the table `differential_accessibility.csv` for any specific clustering includes differential analysis results for both peaks and transcription factor motifs.

The `feature_linkage.bedpe` file in `outs/analysis/feature_linkage` is a tab-delimited file containing information of feature linkages inferred from the pipeline. It follows the BEDPE specification from bedtools and can be directly loaded to the Integrative Genome Viewer (IGV).
See the Feature Linkage Algorithm page for details on how Cell Ranger ARC produces feature linkages.

$ head -5 analysis/feature_linkage/feature_linkage.bedpe chr1 817064 817593 chr1 998050 998051 <FAM87B_promoter><AL645608.7> 0.3074 . . 7.3085 180722 peak-gene chr1 906622 907202 chr1 998050 998051 <AL645608.6_distal><AL645608.7> 0.3544 . . 6.1586 91138 peak-gene chr1 817064 817593 chr1 999980 1000172 <FAM87B_promoter><HES4> 0.4095 . . 13.1158 182747 peak-gene chr1 906622 907202 chr1 999980 1000172 <AL645608.6_distal><HES4> 0.4341 . . 15.8455 93164 peak-gene

The columns are defined as follows:

Column Number | Name | Description |
---|---|---|

1 | chrom1 | The name of the chromosome on which the first end of the feature exists. |

2 | start1 | The zero-based starting position of the first end of the feature on chrom1. |

3 | end1 | The zero-based ending position of the first end of the feature on chrom1. |

4 | chrom2 | The name of the chromosome on which the second end of the feature exists. |

5 | start2 | The zero-based starting position of the second end of the feature on chrom2. |

6 | end2 | The zero-based ending position of the second end of the feature on chrom2. |

7 | name | Defines the name of the linkage with the format of <name1><name2>, in which name1 and name2 are based on gene symbol or peak annotation. |

8 | score | Linkage correlation, ranging from -1 to 1. |

9 | strand1 | Set to ".". |

10 | strand2 | Set to ".". |

11 | significance | Linkage significance: -log10 p-value after multiple testing correction (false discovery rate). Capped at 299. |

12 | distance | Distance in base pairs from feature 2 to feature 1. |

13 | linkage_type | Can be "peak-peak", "gene-peak" or "peak-gene" depending on the type of gene or peak for feature 1 and feature 2. |

The distance between features in a feature linkage is defined as follows:

- For linkages between a gene and a peak: the base pair between the transcription start site (TSS) and the center of the peak. When a gene has multiple TSS, the position of TSS is defined as the center between the leftmost TSS and rightmost TSS.
- For linkages between two peaks: the base pairs between the centers of the two peaks.

Note that linkage distance can be positive or negative. Positive distance means the genomic coordinates are larger in **feature 2** than in **feature 1**. Because the symmetric nature of feature linkage, only linkages with positive or zero distances are output to `feature_linkage.bedpe`.

The `feature_linkage_matrix.h5` file is a compressed HDF5 file containing the sparse matrices of feature linkage correlation and significance, as well as the feature references. The file hierarchy is as follows:

(root) ├── score ├── significance ├── indices ├── indptr └── features [HDF5 group] ├─ _all_tag_keys ├─ feature_type ├─ genome ├─ id ├─ interval └─ name

and the member specifications are as follows:

Column | Type | Description |
---|---|---|

`score` | float64 | Linkage correlation, ranging from -1 to 1. |

`significance` | float64 | Linkage significance: -log10 p-value after multiple testing correction (false discovery rate). Capped at 299. |

`indices` | int64 | CSR format index array of the matrix. |

`indptr` | int64 | CSR format index pointer array of the matrix. |

`feature_type` | string | The type of feature reference to which this feature belongs (Gene Expression or Peaks). |

`genome` | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). For non-gene expression features, this entry is an empty string. |

`id` | string | The unique id corresponding to this feature (Ensembl gene IDs for genes or peak coordinates for peaks). |

`interval` | string | Specifies TSS coordinates for genes, or peak coordinates for peaks. |

`name` | string | A human-readable name associated with this feature (gene symbol for gene features and peak coordinates for peak features). |

The HDF5 group `features`

contains information regarding the feature reference(s) used for the analysis. The datasets within the `features`

group represent columns in a table containing one row per feature. Values in the `feature_idx`

column described in the previous section provide indices into the rows of this hypothetical table.

The linkage correlation and linkage significance matrices are n_feature x n_feature sparse matrices sharing the same sparsity pattern, which is defined by `indices`

and `indptr`

.

Cell Ranger ARC performs a motif scan on peaks and generates a motif-barcode
matrix. The output files are located at `analysis/tf_analysis`, including

- HDF5
`filtered_tf_bc_matrix.h5`and MEX`filtered_tf_bc_matrix`, following the same format of joint feature-barcode matrix - Peak-motif occurrence mappings BED
`peak_motif_mapping.bed`

tf_analysis ├── filtered_tf_bc_matrix │ ├── barcodes.tsv.gz │ ├── matrix.mtx.gz │ └── motifs.tsv ├── filtered_tf_bc_matrix.h5 └── peak_motif_mapping.bed

The `peak_motif_mapping.bed` file is a BED file containing **peak** coordinates and motif
names as the fourth column. Each row represents the occurrence of one motif in one peak as evidenced by the motif scan; a single peak can occur multiple times associated with different motifs.

$ head -5 analysis/tf_analysis/peak_motif_mapping.bed chr1 629732 630166 MAFG::NFE2L1_MA0089.1 chr1 629732 630166 Sox5_MA0087.1 chr1 633796 634260 SHOX_MA0630.1 chr1 633796 634260 VAX2_MA0723.1 chr1 633796 634260 Sox5_MA0087.1

- 2.0 (latest)
- Cell Ranger ARC v
**1.0**