Cell Ranger ATAC2.0, printed on 12/18/2024
When conducting large studies involving multiple GEM wells, run cellranger-atac count on FASTQ data from each of the GEM wells individually, then pool the results using cellranger-atac aggr, as described here.
cellranger-atac aggr is not designed for combining multiple sequencing runs of the same GEM Well. Instead, pass a list of FASTQ files from resequenced libraries to the --fastqs argument of cellranger-atac count. |
The cellranger-atac aggr command inputs a CSV file
specifying a list of cellranger-atac count output files
(specifically the fragments.tsv.gz
, and singlecell.csv
from each run), and produces a single peak-barcode matrix containing all the
data.
When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence (see GEM wells).
By default, the reads from each GEM well are subsampled such that all GEM wells have the same effective sequencing depth, measured in terms of the median number of unique fragments per cell. However, it is possible to turn off this normalization altogether (see Depth Normalization).
The first step is to run a single instance of cellranger-atac count on each individual GEM well prepared using the Chromium platform, as described in Single-GEM Well Analysis.
For example, suppose you ran three count pipelines as follows:
$ cd /opt/runs $ cellranger-atac count --id=LV123 ... ... wait for pipeline to finish ... $ cellranger-atac count --id=LB456 ... ... wait for pipeline to finish ... $ cellranger-atac count --id=LP789 ... ... wait for pipeline to finish ...
You can aggregate these three runs to get an aggregated matrix and analysis. In order to do so, you need to create an Aggregation CSV.
Create a CSV file with a header line containing the following columns:
library_id
: Unique identifier for this input GEM well. This will
be used for labeling purposes only; it does not need to match any previous ID
assigned to the GEM well.fragments
: Path to the fragments.tsv.gz
file
produced by cellranger-atac count. For example, if you
processed your GEM well by calling cellranger-atac count
--id=ID in some directory /DIR
, the
fragments
would be /DIR/ID/outs/fragments.tsv.gz
.cells
: Path to the singlecell.csv
file produced by
cellranger-atac count.You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet should look like this:
A | B | C | |
---|---|---|---|
1 | library_id | fragments | cells |
2 | LV123 | /opt/runs/LV123/outs/fragments.tsv.gz | /opt/runs/LV123/outs/singlecell.csv |
3 | LB456 | /opt/runs/LB456/outs/fragments.tsv.gz | /opt/runs/LB456/outs/singlecell.csv |
4 | LP789 | /opt/runs/LP789/outs/fragments.tsv.gz | /opt/runs/LP789/outs/singlecell.csv |
When you save it as a CSV, the result looks like this:
library_id,fragments,cells LV123,/opt/runs/LV123/outs/fragments.tsv.gz,/opt/runs/LV123/outs/singlecell.csv LB456,/opt/runs/LB456/outs/fragments.tsv.gz,/opt/runs/LB456/outs/singlecell.csv LP789,/opt/runs/LP789/outs/fragments.tsv.gz,/opt/runs/LP789/outs/singlecell.csv
These are the required command line arguments (also available through cellranger-atac aggr --help):
Argument | Description |
---|---|
--id=ID | A unique run id and output folder name [a-zA-Z0-9_-]+ of maximum length 64 characters. |
--csv=CSV | Path to CSV file enumerating cellranger-atac count outputs (see Setting up a CSV). |
--reference=PATH | Path to folder containing a Cell Ranger ATAC or Cell Ranger ARC reference. |
Additional optional parameters are available:
Option | Description |
---|---|
--description=TEXT | Sample description to embed in output files [default: ] |
--peaks=BED | Override peak caller: specify peaks to use in downstream analyses from supplied 3-column
BED file. The supplied peaks file must be sorted by position and not contain overlapping
peaks; comment lines beginning with # are allowed |
--normalize=MODE | Library depth normalization mode [default: depth] [possible values: depth, none] |
--dim-reduce=STR | Dimensionality reduction mode for clustering. Note: plsa has been temporarily
restricted to run in single-threaded mode due to technical considerations. This could
lead to a longer wall time for execution as compared to v1.2. Multi-threading will be
restored in a subsequent release [default: lsa] [possible values: lsa, pca, plsa] |
--jobmode=MODE | Job manager to use. Valid options: local (default), sge, lsf, slurm or path to a .template file. Search for help on "Cluster Mode" at support.10xgenomics.com for more details on configuring the pipeline to use a compute cluster [default: local] |
--localcores=NUM | Set max cores the pipeline may request at one time. Only applies to local jobs |
--localmem=NUM | Set max memory (GB) the pipeline may request at one time. Only applies to local jobs |
--localvmem=NUM | Set max virtual address space in GB for the pipeline. Only applies to local jobs |
--mempercore=NUM | Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much memory available. Only applies to cluster jobmodes |
--maxjobs | Set max jobs submitted to cluster at one time. Only applies to cluster jobmodes |
--jobinterval | Set delay between submitting jobs to cluster, in ms. Only applies to cluster jobmodes |
--overrides=PATH | The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and --localmem. Consult https://support.10xgenomics.com/ for an example override file |
--uiport=PORT | Serve web UI at http://localhost:PORT |
After specifying input arguments and options, run cellranger-atac aggr:
$ cd /home/jdoe/runs $ cellranger-atac aggr --id=AGG123 \ --csv=AGG123_libraries.csv \ --normalize=depth \ --reference=/home/jdoe/refs/hg19
The pipeline will begin to run, creating a new folder named with the aggregation
ID specified with the --id
argument (e.g.
/home/jdoe/runs/AGG123
). If this output folder already exists,
cellranger-atac will assume it is an existing pipestance
and attempt to resume running it.
When combining data from multiple GEM wells, the cellranger-atac
aggr pipeline automatically equalizes the average read depth per cell
between groups before merging. When libraries are sequenced to very different
read depth per cell you may observe that cells cluster by library of origin
rather than cell type. This is commonly referred to as a batch effect in the
literature. A multitude of factors can cause batch effects in single cell data
and sequencing depth is only one of them. The downsampling normalization in
cellranger-atac aggr specifically addresses sequencing
depth batch effects but not others. It is possible to turn off normalization or
change the way normalization is done. The none
option may be
appropriate if you want to maximize sensitivity and plan to deal with depth
normalization or more general batch correction in a downstream step.
There are two normalization modes:
none
: Do not normalize at all.depth
(default): Subsample reads from higher-depth GEM wells until they all
have, on average, an equal number of median unique fragments per cell.The cellranger-atac aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.
Each output file produced by cellranger-atac aggr follows the format described in the Understanding Output section, but includes the union of all the relevant barcodes from each input job.
cellranger-atac aggr does not perform a cell-calling step, it simply aggregates the cell calls as encoded in singlecell.csv from each input job into a final set of cell calls. |
A successful run will conclude with a message like this:
2021-04-24 14:06:25 [runtime] (update) ID.AGG123.SC_ATAC_AGGREGATOR_CS.SC_ATAC_AGGREGATOR.ATAC_CLOUPE_PREPROCESS.fork0 join_running 2021-04-24 14:07:13 [runtime] (join_complete) ID.AGG123.SC_ATAC_AGGREGATOR_CS.SC_ATAC_AGGREGATOR.ATAC_CLOUPE_PREPROCESS Outputs: - Barcoded and aligned fragment file: /home/jdoe/runs/AGG123/outs/fragments.tsv.gz - Fragment file index: /home/jdoe/runs/AGG123/outs/fragments.tsv.gz.tbi - Per-barcode fragment counts & metrics: /home/jdoe/runs/AGG123/outs/singlecell.csv - Bed file of all called peak locations: /home/jdoe/runs/AGG123/outs/peaks.bed - Filtered peak barcode matrix in hdf5 format: /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix.h5 - Filtered peak barcode matrix in mex format: /home/jdoe/runs/AGG123/outs/filtered_peak_bc_matrix - Directory of analysis files: /home/jdoe/runs/AGG123/outs/analysis - HTML file summarizing aggregation analysis : /home/jdoe/runs/AGG123/outs/web_summary.html - Filtered tf barcode matrix in hdf5 format: /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix.h5 - Filtered tf barcode matrix in mex format: /home/jdoe/runs/AGG123/outs/filtered_tf_bc_matrix - Loupe Browser input file: /home/jdoe/runs/AGG123/outs/cloupe.cloupe - csv summarizing important metrics and values: /home/jdoe/runs/AGG123/outs/summary.csv - Summary of all data metrics: /home/jdoe/runs/AGG123/outs/summary.json - Annotation of peaks with genes: /home/jdoe/runs/AGG123/outs/peak_annotation.tsv - Csv of aggregation of libraries: /home/jdoe/runs/AGG123/outs/aggregation_csv.csv Pipestance completed successfully!
Once cellranger-atac aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger-atac aggr section of the Summary Metrics page.
Each GEM well is a physically distinct set of GEM partitions, but draws barcode
sequences randomly from the pool of valid barcodes, known as the barcode
whitelist. To keep the barcodes unique when aggregating multiple libraries, we
append a small integer identifying the GEM well to the barcode nucleotide
sequence, and use that nucleotide sequence plus ID as the unique identifier in
the feature-barcode matrix. For example, AGACCATTGAGACTTA-1
and
AGACCATTGAGACTTA-2
are distinct cell barcodes from different GEM
wells, despite having the same barcode nucleotide sequence.
This number, which indicates which GEM well the barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.