Software  ›   pipelines

Aggregating Multiple Libraries with cellranger aggr

When doing large studies involving multiple biological samples (or multiple libraries / replicates of the same sample), it is best to run cellranger count on each of the libraries individually, and then pool the results using cellranger aggr.

The cellranger aggr command takes a CSV file specifying a list of cellranger count output files (specifically the molecule_info.h5 from each run), and produces a single gene-barcode matrix containing all the data.

When combining multiple libraries, the barcode sequences for each library are distinguished by their GEM group (see Gem Groups).

By default, each library's reads are subsampled such that all libraries have the same effective sequencing depth, measured in terms of reads per cell (see Depth Normalization).

Requirements

The first step is to run cellranger count on each individual library prepared using the 10x system, as described in Single-Library Analysis.

For example, suppose you ran three count pipelines as follows:

$cd /opt/runs$ cellranger count --id=LV123 ...
... wait for pipeline to finish ...
$cellranger count --id=LB456 ... ... wait for pipeline to finish ...$ cellranger count --id=LP789 ...
... wait for pipeline to finish ...

Now you want to aggregate these three runs to get a single gene-barcode matrix and analysis. In order to do so, you need to create an Aggregation CSV.

Setting Up An Aggregation CSV

Create a CSV file with a header line containing the following columns:

• library_id: Unique identifier for this input library. This will be used for labeling purposes only; it doesn't need to match any previous ID you've assigned to the library.
• molecule_h5: Path to the molecule_info.h5 file produced by cellranger count. So if you processed your library by calling cellranger count --id=ID in some directory /DIR, this path would be /DIR/ID/outs/molecule_info.h5.

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet would look like this:

AB
1library_idmolecule_h5
2LV123/opt/runs/LV123/outs/molecule_info.h5
3LB456/opt/runs/LB456/outs/molecule_info.h5
4LP789/opt/runs/LP789/outs/molecule_info.h5

When you save it as a CSV, the result would look like this:

library_id,molecule_h5
LV123,/opt/runs/LV123/outs/molecule_info.h5
LB456,/opt/runs/LB456/outs/molecule_info.h5
LP789,/opt/runs/LP789/outs/molecule_info.h5


Command Line Interface

These are the most common command line arguments (run cellranger aggr --help for a full list):

ArgumentDescription
--id=IDA unique run ID string: e.g. AGG123
--csv=CSVPath of a CSV file containing a list of cellranger count outputs (see Setting up a CSV).
--normalize=MODE(Optional) String specifying how to normalize depth across the input libraries. Valid values: mapped (default), raw, or none (see Depth Normalization).
--nosecondary(Optional) Add this flag to skip secondary analysis of the gene-barcode matrix (dimensionality reduction, clustering and visualization). Set this if you plan to use cellranger reanalyze or your own custom analysis.

After specifying these input arguments, run cellranger aggr:

$cd /home/jdoe/runs$ cellranger aggr --id=AGG123 \
--csv=AGG123_libraries.csv \
--normalize=mapped


The pipeline will begin to run, creating a new folder named with the aggregation ID you specified (e.g. /home/jdoe/runs/AGG123) for its output. If this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it.

Pipeline Outputs

A successful run should conclude with a message similar to this:

2016-11-04 13:36:33 [runtime] (run:local)       ID.AGG123.SC_RNA_AGGREGATOR_CS.SC_RNA_AGGREGATOR.SUMMARIZE_AGGREGATED_REPORTS.fork0.join
2016-11-04 13:36:36 [runtime] (join_complete)   ID.AGG123.SC_RNA_AGGREGATOR_CS.SC_RNA_AGGREGATOR.SUMMARIZE_AGGREGATED_REPORTS
2016-11-04 13:36:45 [runtime] VDR killed 210 files, 29MB.

Outputs:
- Aggregation metrics summary HTML:       /home/jdoe/runs/AGG123/outs/web_summary.html
- Aggregation metrics summary JSON:       /home/jdoe/runs/AGG123/outs/summary.json
- Secondary analysis output CSV:          /home/jdoe/runs/AGG123/outs/analysis_csv
- Filtered gene-barcode matrices HDF5:    /home/jdoe/runs/AGG123/outs/filtered_gene_bc_matrices_h5.h5
- Filtered gene-barcode matrices MEX:     /home/jdoe/runs/AGG123/outs/filtered_gene_bc_matrices_mex
- Filtered molecule-level info:           /home/jdoe/runs/AGG123/outs/filtered_molecules.h5
- Unfiltered gene-barcode matrices HDF5:  /home/jdoe/runs/AGG123/outs/raw_gene_bc_matrices_h5.h5
- Unfiltered gene-barcode matrices MEX:   /home/jdoe/runs/AGG123/outs/raw_gene_bc_matrices_mex
- Unfiltered molecule-level info:         /home/jdoe/runs/AGG123/outs/raw_molecules.h5
- Barcodes of cell-containing partitions: /home/jdoe/runs/AGG123/outs/cell_barcodes.csv
- Copy of the input CSV:                  /home/jdoe/runs/AGG123/outs/aggregation_csv.csv
- Loupe Cell Browser file:                /home/jdoe/runs/AGG123/outs/cloupe.cloupe

Pipestance completed successfully!


Once cellranger has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Cell Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger aggr section of the Summary Metrics page.

Gem Groups

This is an integer that is appended to each barcode in the gene-barcode matrix. For example, AGACCATTGAGACTTA-1 and AGACCATTGAGACTTA-2 are distinct cell barcodes from different libraries, despite having the same nucleotide sequence.

The numbering of the GEM groups will reflect the order that the libraries were provided in the Aggregation CSV.

Depth Normalization

When combining data from multiple libraries, we recommend equalizing the read depth between libraries before merging, to reduce the batch effect introduced by sequencing. The cellranger aggr pipeline does this automatically by default, but you can choose to turn it off or change the way the normalization is done.

There are three normalization modes:

• mapped: (default) Subsample reads from higher-depth libraries until they all have an equal number of confidently mapped reads per cell.
• raw: Subsample reads from higher-depth libraries until they all have an equal number of total (i.e. raw, mapping-independent) reads per cell.
• none: Do not normalize at all.