HOME  ›   pipelines

# What is aggr?

Many experiments involve generating data for multiple samples that are processed via same Gel Bead-in Emulsion (GEM) wells on the Chromium instrument or through different GEM wells. Depending on the experimental design, these could be replicates from the same set of cells, cells from different tissue/time points from the same individual, or cells from different individuals. The cellranger count pipeline processes data from a single sample in a single GEM well. While the cellranger multi pipeline processes data from multiple samples in a single GEM well. The aggr pipeline aggregates the outputs for multiple samples generated via multiple runs of cellranger count and performs analysis on the combined data. The aggr pipeline also aggregates the outputs for multiple samples generated via single or multiple runs of cellranger multi.

# Aggregating outputs from cellranger count

When doing large studies involving samples run on multiple GEM wells, run cellranger count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr, as described here.

The cellranger aggr command takes a CSV file specifying a list of cellranger count output files (specifically the molecule_info.h5 from each run), and produces a single feature-barcode matrix containing all the data.

When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence (see GEM wells).

By default, the reads from each GEM well are subsampled such that all GEM wells have the same effective sequencing depth, measured in terms of reads that are confidently mapped to the transcriptome or assigned to the feature IDs per cell. However, it is possible to change the depth normalization mode (see Depth Normalization).

## Requirements

The first step is to run cellranger count on each individual GEM well prepared using the 10x Chromium™ platform, as described in Single-GEM Well Analysis.

If Feature Barcode analysis is included, then the feature reference CSV file provided to cellranger count should be the same for each GEM well. Targeted Gene Expression data is supported by cellranger aggr and can be aggregated with whole transcriptome Gene Expression data, provided that all GEM wells have matching chemistries and that the same target panel CSV file is used for all targeted samples.

For example, suppose you ran three count pipelines as follows:

$cd /opt/runs$ cellranger count --id=LV123 ...
... wait for pipeline to finish ...
$cellranger count --id=LB456 ... ... wait for pipeline to finish ...$ cellranger count --id=LP789 ...
... wait for pipeline to finish ...

Now you can aggregate these three runs to get a single feature-barcode matrix and analysis. In order to do so, you need to create an Aggregation CSV.

## Setting Up an Aggregation CSV

Create a CSV file with a header line containing the following columns:

• sample_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it doesn't need to match any previous ID you've assigned to the GEM well.
• molecule_h5: Path to the molecule_info.h5 file produced by cellranger count. For example, if you processed your GEM well by calling cellranger count --id=ID in some directory /DIR, this path would be /DIR/ID/outs/molecule_info.h5.

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet would look like this:

AB
1sample_idmolecule_h5
2LV123/opt/runs/LV123/outs/molecule_info.h5
3LB456/opt/runs/LB456/outs/molecule_info.h5
4LP789/opt/runs/LP789/outs/molecule_info.h5

When you save it as a CSV, the result would look like this:

sample_id,molecule_h5
LV123,/opt/runs/LV123/outs/molecule_info.h5
LB456,/opt/runs/LB456/outs/molecule_info.h5
LP789,/opt/runs/LP789/outs/molecule_info.h5


In addition to the CSV columns expected by cellranger aggr, you may optionally supply additional columns containing library meta-data (e.g., lab or sample origin). These custom library annotations do not affect the analysis pipeline but can be visualized downstream in the Loupe Browser (see below). Note that unlike other CSV inputs to Cell Ranger, these custom columns may contain characters outside the ASCII range (e.g., non-Latin characters).

# Aggregating outputs from cellranger multi

The cellranger aggr command can take a CSV file specifying a list of cellranger multi per sample molecule_info.h5 files, and perform aggregation on any combination of Gene Expression, Feature Barcode (cell surface protein or CRISPR) that are present in the individual samples outputs.

Consider two per sample datasets containing data from one 3' CellPlexing experiment:

$cd /opt/runs$ cellranger multi --id=Run1 ...
... wait for pipeline to finish ...


To aggregate the datasets, you need to create a CSV containing the following columns:

Column NameDescription
sample_idUnique identifier for this sample. This will be used for labeling purposes only.
molecule_h5Path to the per sample molecule_info.h5 file generated by cellranger multi pipeline. For example, if you processed your CellPlex data by calling cellranger multi --id=ID in some directory /DIR, and the sample was called Sample1, this path would be /DIR/ID/outs/per_sample_outs/Sample1/count/molecule_info.h5

Apart from the change in the path to the per sample molecule_info.h5 file, the sections on additional columns for creating categories, depth normalization, batch correction etc. applies here as well.

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Your Excel spreadsheet might look like this:

AB
1sample_idmolecule_h5
2Sample1/opt/runs/Run1/outs/per_sample_outs/Sample1/count/sample_molecule_info.h5
3Sample2/opt/runs/Run1/outs/per_sample_outs/Sample2/count/sample_molecule_info.h5

When you save it as a CSV, the result would look like this:

sample_id,molecule_h5
Sample1,/opt/runs/Run1/outs/per_sample_outs/Sample1/count/sample_molecule_info.h5
Sample2,/opt/runs/Run1/outs/per_sample_outs/Sample2/count/sample_molecule_info.h5


You can run the aggr pipeline as follows:

$cd /opt/runs$ cellranger aggr --id=MySamples --csv=aggr.csv


# Creating Categories

When combining multiple samples into a single dataset with the cellranger aggr pipeline, you can assign categories and values to individual samples by adding columns to the cellranger aggr input spreadsheet. These category assignments propagate into Loupe Browser, where you can view them, and determine genes that drive differences within samples. For example, the following spreadsheet was used to aggregate the tutorial dataset:

ABC
1sample_idmolecule_h5AMLStatus
2AMLNormal1/path/to/AMLNormal1/molecule_info.h5Normal
3AMLNormal2/path/to/AMLNormal2/molecule_info.h5Normal
4AMLPatient/path/to/AMLPatient/molecule_info.h5Patient

Any columns in addition to 'sample_id' and 'molecule_h5' will be converted into categories, and the cells in each sample will be assigned to one of the values in that category.

# Aggregating Libraries With Different Chemistry Versions

If you are aggregating libraries generated by different chemistry versions of the Single Cell Gene Expression Reagents, you might observe systematic differences in gene expression profiles between libraries. The cellranger aggr pipeline incorporates batch effect correction (algorithm details) to overcome this. To enable this module, you should include the following column in your aggregation CSV file:

• batch: (optional) Unique identifier for the batch that this GEM well belongs to. Libraries with the same batch identifier will be considered to be in the same batch.

For example, if the LV123 sample in the previous example is a v2 library, and the LB456 and LP789 samples are v3 libraries, you would set up the aggregation CSV file like this:

sample_id,molecule_h5,batch
LV123,/opt/runs/LV123/outs/molecule_info.h5,v2_lib
LB456,/opt/runs/LB456/outs/molecule_info.h5,v3_lib
LP789,/opt/runs/LP789/outs/molecule_info.h5,v3_lib


The v2_lib and v3_lib identifiers are just example identifiers. Every sample from a given batch has to have the same batch identifier, but otherwise the identifier itself is arbitrary.

• This Chemistry Batch Correction is specifically intended to correct for systematic variability in gene expression profiles caused by different versions of the Single Cell Gene Expression chemistry. 10x has tested and verified its effectiveness primarily on aggregating Single Cell Gene Expression v2 and v3 chemistry with well-matched input material. The module may be useful in other scenarios but will require careful validation of results.
• Chemistry batch correction affects the PCA, t-SNE visualization and clustering results. Values in the aggregated feature-barcode matrix are not adjusted by Chemistry Batch Correction. Differential expression analysis is still performed on the feature-barcode count matrix.
• The batch effect score (described in the algorithm details) is recommended to compare the performance of batch correction. Besides the batch effect, it also depends on the composition of the cell population across batches.
• When the chemistry batch correction is enabled, the FBPCA is used to perform dimensionality reduction, instead of the IRLBA PCA.
• The minimum System Requirements of 64GB RAM will allow batch correction on datasets with a total number of 128k cells.

## Aggregating 5' and 3' Gene Expression Data

The cellranger aggr pipeline uses Chemistry Batch Correction when aggregating results from a combination of 5' and 3', or 3' v2 and 3' v3 Gene Expression data. Enabling Chemistry Batch Correction in this scenario improves the mixing of the batches in the t-SNE visualization and clustering results. Therefore we recommend using Chemistry Batch correction. However, residual batch effects may still be present, and we advise careful validation of the results. In particular for the V(D)J genes, the 5' assay will generally count the V gene segments of the immune receptor (e.g. TRBV12-1 or IGH4-2), while the 3' assay will count the C gene segments (e.g. TRBC or IGHA), which may pose additional analysis challenges.

# Aggregating Targeted Gene Expression Data

The cellranger aggr pipeline can aggregate results that include Targeted Gene Expression analysis provided that the requirements above are met. Secondary analysis for all samples is done with the non-targeted genes excluded from the feature-barcode matrices. Aggregated feature-barcode matrices follow the same convention as Targeted Gene Expression analysis: the filtered feature-barcode matrices do not include non-targeted genes, whereas the raw feature-barcode matrices still include all genes.

# Command Line Interface

These are the most common command line arguments (run cellranger aggr --help for a full list):

ArgumentDescription
--id=IDA unique run ID string: e.g. AGG123
--csv=CSVPath of a CSV file containing a list of cellranger count outputs (see Setting up a CSV).
--normalize=MODE(Optional) String specifying how to normalize depth across the input libraries. Valid values: mapped (default), or none (see Depth Normalization).
--nosecondary(Optional) Add this flag to skip secondary analysis which includes dimensionality reduction, clustering and visualization. This is applicable if you plan to use cellranger reanalyze or your own custom analysis.

After specifying these input arguments, run cellranger aggr:

$cd /home/jdoe/runs$ cellranger aggr --id=AGG123 \
--csv=AGG123_libraries.csv \
--normalize=mapped


The pipeline will begin to run, creating a new folder named with the aggregation ID you specified (e.g. /home/jdoe/runs/AGG123) for its output. If this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it.

## Depth Normalization

When combining data from multiple GEM wells, the cellranger aggr pipeline automatically equalizes the average read depth per cell between groups before merging. This approach avoids artifacts that may be introduced due to differences in sequencing depth. It is possible to turn off normalization or change the way normalization is done. The none option may be appropriate if you want to maximize sensitivity and plan to deal with depth normalization in a downstream step.

There are two normalization modes:

• none: Do not normalize at all.
• mapped (default): For each library type, subsample reads from higher-depth GEM wells until they all have, on average, an equal number of reads per cell that are confidently mapped to the transcriptome (Gene Expression) or assigned to known features (Feature Barcode Technology). If Targeted Gene Expression libraries are included, then normalization is performed on the basis of average reads per cell mapped confidently to the targeted transcriptome. The subsampling rates for Targeted Gene Expression libraries are all multiplied by 2 (provided all samples can achieve that depth). This is consistent with sequencing depth recommendations and is also done to avoid removing large fractions of reads from targeted libraries whenever they are combined with whole transcriptome libraries.

# Pipeline Outputs

The cellranger aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.

Each output file produced by cellranger aggr follows the format described in the Understanding Output section of the documentation, but includes the union of all the relevant barcodes from each input jobs.

A successful run should conclude with a message similar to this:

Outputs:
- Aggregation metrics summary HTML:         /home/jdoe/runs/AGG123/outs/web_summary.html
- Aggregation metrics summary JSON:         /home/jdoe/runs/AGG123/outs/summary.json
- Secondary analysis output CSV:            /home/jdoe/runs/AGG123/outs/analysis
- Filtered feature-barcode matrices MEX:    /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix
- Filtered feature-barcode matrices HDF5:   /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix.h5
- Unfiltered feature-barcode matrices MEX:  /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix
- Unfiltered feature-barcode matrices HDF5: /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix.h5
- Copy of the input aggregation CSV:        /home/jdoe/runs/AGG123/outs/aggregation.csv
- Loupe Browser file:                       /home/jdoe/runs/AGG123/outs/cloupe.cloupe

Pipestance completed successfully!


Once cellranger aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger aggr section of the Summary Metrics page.

# Understanding GEM Wells

Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example, AGACCATTGAGACTTA-1 and AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.

This number, which tells us which GEM well this barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.