Cell Ranger ATAC1.2, printed on 12/10/2023
When doing large studies involving multiple GEM wells, run cellranger-atac count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger-atac aggr, as described here.
|cellranger-atac aggr is not designed for combining multiple sequencing runs of the GEM Well. For that, you should pass a list of FASTQ files from multiple sequencing runs of the same GEM well to the --fastqs argument of cellranger-atac count.|
The cellranger-atac aggr command takes a CSV file specifying a list of cellranger-atac count output files (specifically the
singlecell.csv from each run), and produces a single peak-barcode matrix containing all the data.
When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence (see GEM wells).
By default, the fragments from each GEM well are subsampled such that all GEM wells have the same effective depth, measured in terms of median unique fragments per cell. However, it is possible to change the normalization mode (see section on equalizing sensitivity).
The first step is to run cellranger-atac count on each individual GEM well prepared using the 10x Chromium™ platform, as described in Single-GEM Well Analysis.
For example, suppose you ran three count pipelines as follows:
$ cd /opt/runs $ cellranger-atac count --id=LV123 ... ... wait for pipeline to finish ... $ cellranger-atac count --id=LB456 ... ... wait for pipeline to finish ... $ cellranger-atac count --id=LP789 ... ... wait for pipeline to finish ...
Now you can aggregate these three runs to get an aggregated matrix and analysis. In order to do so, you need to create an Aggregation CSV.
Create a CSV file with a header line containing the following columns:
library_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it doesn't need to match any previous ID you've assigned to the GEM well.
fragments: Path to the
fragments.tsv.gzfile produced by cellranger-atac count. For example, if you processed your GEM well by calling cellranger-atac count --id=ID in some directory
cells: Path to the
singlecell.csvfile produced by cellranger-atac count.
peaks: Path to the
peaks.bedfile produced by cellranger-atac count.
|cellranger-atac aggr allows you to optionally specify the peaks bed file under peaks column. If you do not specify the peaks file, the pipeline will do peak calling on the aggregated data and run downstream analysis. If you supply the peaks file (must be done for each row), the pipeline will skip internal peak calling and instead merge the specified peaks. Using the peaks and cells columns, a user can specify custom peaks and select cells of choice to be used in the analysis of the aggregated data.|
You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet would look like this:
When you save it as a CSV, the result would look like this:
library_id,fragments,cells LV123,/opt/runs/LV123/outs/fragments.tsv.gz,/opt/runs/LV123/outs/singlecell.csv LB456,/opt/runs/LB456/outs/fragments.tsv.gz,/opt/runs/LB456/outs/singlecell.csv LP789,/opt/runs/LP789/outs/fragments.tsv.gz,/opt/runs/LP789/outs/singlecell.csv
These are the most common command line arguments (run cellranger-atac aggr --help for a full list):
|A unique run ID string: e.g. |
|Path of a CSV file containing a list of cellranger-atac count outputs (see Setting up a CSV).|
|Path to a Cell Ranger ATAC reference.|
|(Optional) String specifying how to normalize the input libraries. Valid values: |
|(Optional) Add this flag to skip secondary analysis which includes dimensionality reduction, clustering and visualization. This is applicable if you plan to use cellranger-atac reanalyze or your own custom analysis.|
|(Optional) Dimensionality reduction mode for clustering. Valid values: |
After specifying these input arguments, run cellranger-atac aggr:
$ cd /home/jdoe/runs $ cellranger-atac aggr --id=AGG123 \ --csv=AGG123_libraries.csv \ --normalize=depth \ --reference=/home/jdoe/refs/hg19
The pipeline will begin to run, creating a new folder named with the aggregation ID you specified (e.g.
/home/jdoe/runs/AGG123) for its output. If this folder already exists, cellranger-atac will assume it is an existing pipestance and attempt to resume running it.
The cellranger-atac aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.
Each output file produced by cellranger-atac aggr follows the format described in the Understanding Output section of the documentation, but includes the union of all the relevant barcodes from each input jobs.
|cellranger-atac aggr does not perform a cell-calling step, it simply aggregates the cell calls as encoded in singlecell.csv from each input job into a final set of cell calls.|
A successful run should conclude with a message similar to this:
2019-03-21 10:14:34 [runtime] (run:hydra) ID.AGG123.SC_ATAC_AGGREGATOR_CS.CLOUPE_PREPROCESS.fork0.join 2019-03-21 10:14:40 [runtime] (join_complete) ID.AGG123.SC_ATAC_AGGREGATOR_CS.CLOUPE_PREPROCESS 2019-03-21 10:14:40 [runtime] VDR killed 281 files, 42 MB. Outputs: - Barcoded and aligned fragment file: /home/jdoe/runs/AGG123/outs/fragments.tsv.gz - Fragment file index: /home/jdoe/runs/AGG123/outs/fragments.tsv.gz.tbi - Per-barcode fragment counts & metrics: /home/jdoe/runs/AGG123/outs/singlecell.csv - Bed file of all called peak locations: /home/jdoe/runs/AGG123/outs/peaks.bed - Filtered peak barcode matrix in hdf5 format: /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix.h5 - Filtered peak barcode matrix in mex format: /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix - Directory of analysis files: /home/jdoe/runs/AGG123/outs/analysis - HTML file summarizing aggregation analysis : /home/jdoe/runs/AGG123/outs/web_summary.html - Filtered tf barcode matrix in hdf5 format: /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix.h5 - Filtered tf barcode matrix in mex format: /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix - Loupe Browser input file: /home/jdoe/runs/AGG123/outs/cloupe.cloupe - csv summarizing important metrics and values: /home/jdoe/runs/AGG123/outs/summary.csv - Summary of all data metrics: /home/jdoe/runs/AGG123/outs/summary.json - Annotation of peaks with genes: /home/jdoe/runs/AGG123/outs/peak_annotation.tsv - Csv of aggregation of libraries: /home/jdoe/runs/AGG123/outs/aggregation_csv.csv Pipestance completed successfully!
Once cellranger-atac aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger-atac aggr section of the Summary Metrics page.
Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example,
AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.
This number, which tells us which GEM well this barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.
When combining data from multiple GEM groups, the cellranger-atac aggr pipeline automatically equalizes the sensitivity of the groups before merging, which is the recommended approach in order to avoid the batch effect introduced by sequencing depth. It is possible to turn off normalization or change the way normalization is done. The
none option may be appropriate if you want to maximize sensitivity of the input libraries, and plan to deal with normalization in a downstream step.
There are three normalization modes:
depth: (default) Subsample fragments from higher-depth GEM wells until they all have an equal number of unique fragments per cell.
none: Do not normalize at all.
signal: Subsample fragments from GEM wells such that each GEM well library has the same distribution of enriched cut sites along the genome. Read the algorithms section on aggregation for more details.