Software  ›   pipelines

# Aggregating Multiple GEM Groups with cellranger-atac aggr

When doing large studies involving multiple GEM wells, run cellranger-atac count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger-atac aggr, as described here.

The cellranger-atac aggr command takes a CSV file specifying a list of cellranger-atac count output files (specifically the fragments.tsv.gz, and singlecell.csv from each run), and produces a single peak-barcode matrix containing all the data.

When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence (see GEM wells).

By default, the fragments from each GEM well are subsampled such that all GEM wells have the same effective depth, measured in terms of median unique fragments per cell. However, it is possible to change the normalization mode (see section on equalizing sensitivity).

## Requirements

The first step is to run cellranger-atac count on each individual GEM well prepared using the 10x Chromium™ platform, as described in Single-GEM Well Analysis.

For example, suppose you ran three count pipelines as follows:

$cd /opt/runs$ cellranger-atac count --id=LV123 ...
... wait for pipeline to finish ...
$cellranger-atac count --id=LB456 ... ... wait for pipeline to finish ...$ cellranger-atac count --id=LP789 ...
... wait for pipeline to finish ...

Now you can aggregate these three runs to get an aggregated matrix and analysis. In order to do so, you need to create an Aggregation CSV.

## Setting Up An Aggregation CSV

Create a CSV file with a header line containing the following columns:

• library_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it doesn't need to match any previous ID you've assigned to the GEM well.
• fragments: Path to the fragments.tsv.gz file produced by cellranger-atac count. For example, if you processed your GEM well by calling cellranger-atac count --id=ID in some directory /DIR, the fragments would be /DIR/ID/outs/fragments.tsv.gz.
• cells: Path to the singlecell.csv file produced by cellranger-atac count.
• (Optional) peaks: Path to the peaks.bed file produced by cellranger-atac count.
• (Optional) Additional custom columns containing library meta-data (e.g., lab or sample origin). These custom library annotations do not affect the analysis pipeline but can be visualized downstream in the Loupe Cell Browser. Note that unlike other CSV inputs to Cell Ranger ATAC, these custom columns may contain characters outside the ASCII range (e.g., non-Latin characters).

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet would look like this:

ABC
1library_idfragmentscells
2LV123/opt/runs/LV123/outs/fragments.tsv.gz/opt/runs/LV123/outs/singlecell.csv
3LB456/opt/runs/LB456/outs/fragments.tsv.gz/opt/runs/LB456/outs/singlecell.csv
4LP789/opt/runs/LP789/outs/fragments.tsv.gz/opt/runs/LP789/outs/singlecell.csv

When you save it as a CSV, the result would look like this:

library_id,fragments,cells
LV123,/opt/runs/LV123/outs/fragments.tsv.gz,/opt/runs/LV123/outs/singlecell.csv
LB456,/opt/runs/LB456/outs/fragments.tsv.gz,/opt/runs/LB456/outs/singlecell.csv
LP789,/opt/runs/LP789/outs/fragments.tsv.gz,/opt/runs/LP789/outs/singlecell.csv


## Command Line Interface

These are the most common command line arguments (run cellranger-atac aggr --help for a full list):

ArgumentDescription
--id=IDA unique run ID string: e.g. AGG123
--csv=CSVPath of a CSV file containing a list of cellranger-atac count outputs (see Setting up a CSV).
--reference=PATHPath to a Cell Ranger ATAC reference.
--normalize=MODE(Optional) String specifying how to normalize the input libraries. Valid values: depth (default), signal, or none (see Equalize Sensitivity).
--nosecondary(Optional) Add this flag to skip secondary analysis which includes dimensionality reduction, clustering and visualization. This is applicable if you plan to use cellranger-atac reanalyze or your own custom analysis.
--dim-reduce=MODE(Optional) Dimensionality reduction mode for clustering. Valid values: lsa (default), pca, or plsa.

After specifying these input arguments, run cellranger-atac aggr:

$cd /home/jdoe/runs$ cellranger-atac aggr --id=AGG123 \
--csv=AGG123_libraries.csv \
--normalize=depth \
--reference=/home/jdoe/refs/hg19


The pipeline will begin to run, creating a new folder named with the aggregation ID you specified (e.g. /home/jdoe/runs/AGG123) for its output. If this folder already exists, cellranger-atac will assume it is an existing pipestance and attempt to resume running it.

## Pipeline Outputs

The cellranger-atac aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.

Each output file produced by cellranger-atac aggr follows the format described in the Understanding Output section of the documentation, but includes the union of all the relevant barcodes from each input jobs.

A successful run should conclude with a message similar to this:

2019-03-21 10:14:34 [runtime] (run:hydra)       ID.AGG123.SC_ATAC_AGGREGATOR_CS.CLOUPE_PREPROCESS.fork0.join
2019-03-21 10:14:40 [runtime] (join_complete)   ID.AGG123.SC_ATAC_AGGREGATOR_CS.CLOUPE_PREPROCESS
2019-03-21 10:14:40 [runtime] VDR killed 281 files, 42 MB.

Outputs:
- Barcoded and aligned fragment file:           /home/jdoe/runs/AGG123/outs/fragments.tsv.gz
- Fragment file index:                          /home/jdoe/runs/AGG123/outs/fragments.tsv.gz.tbi
- Per-barcode fragment counts & metrics:        /home/jdoe/runs/AGG123/outs/singlecell.csv
- Bed file of all called peak locations:        /home/jdoe/runs/AGG123/outs/peaks.bed
- Filtered peak barcode matrix in hdf5 format:  /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix.h5
- Filtered peak barcode matrix in mex format:   /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix
- Directory of analysis files:                  /home/jdoe/runs/AGG123/outs/analysis
- HTML file summarizing aggregation analysis :  /home/jdoe/runs/AGG123/outs/web_summary.html
- Filtered tf barcode matrix in hdf5 format:    /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix.h5
- Filtered tf barcode matrix in mex format:     /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix
- Loupe Cell Browser input file:                /home/jdoe/runs/AGG123/outs/cloupe.cloupe
- csv summarizing important metrics and values: /home/jdoe/runs/AGG123/outs/summary.csv
- Summary of all data metrics:                  /home/jdoe/runs/AGG123/outs/summary.json
- Annotation of peaks with genes:               /home/jdoe/runs/AGG123/outs/peak_annotation.tsv
- Csv of aggregation of libraries:              /home/jdoe/runs/AGG123/outs/aggregation_csv.csv

Pipestance completed successfully!


Once cellranger-atac aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Cell Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger-atac aggr section of the Summary Metrics page.

## Understanding GEM Wells

Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example, AGACCATTGAGACTTA-1 and AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.

This number, which tells us which GEM well this barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.

## Depth Normalization: equalize sensitivity

When combining data from multiple GEM groups, the cellranger-atac aggr pipeline automatically equalizes the sensitivity of the groups before merging, which is the recommended approach in order to avoid the batch effect introduced by sequencing depth. It is possible to turn off normalization or change the way normalization is done. The none option may be appropriate if you want to maximize sensitivity of the input libraries, and plan to deal with normalization in a downstream step.

There are three normalization modes:

• depth: (default) Subsample fragments from higher-depth GEM wells until they all have an equal number of unique fragments per cell.
• none: Do not normalize at all.
• signal: Subsample fragments from GEM wells such that each GEM well library has the same distribution of enriched cut sites along the genome. Read the algorithms section on aggregation for more details.
• 1.0
• Cell Ranger ATAC v1.1 (latest)