Cell Ranger7.1, printed on 03/04/2024
Many experiments generate data for multiple samples. Depending on the experimental design, these could be replicates from the same set of cells, cells from different tissues or time points from the same individual, or cells from different individuals. Samples could be processed through different Gel Bead-in Emulsion (GEM) wells or multiplexed within the same GEM well on Chromium instruments. The cellranger aggr pipeline can be used to aggregate samples from these scenarios into a single feature-barcode matrix.
|cellranger aggr is not designed to aggregate multiple sequencing runs of the same library (e.g., resequencing the same library to increase read depth). Instead, for this case, specify all FASTQ files (fastqs field) in a single analysis of either cellranger count or multi.
For example, suppose you ran three count pipelines as follows:
$ cd /opt/runs $ cellranger count --id=LV123 ... ... wait for pipeline to finish ... $ cellranger count --id=LB456 ... ... wait for pipeline to finish ... $ cellranger count --id=LP789 ... ... wait for pipeline to finish ...
These three runs can be aggregated with cellranger aggr to get a single feature-barcode matrix and secondary analysis outputs. In order to do so, next create an aggregation CSV file.
Create a CSV file with a header line containing the following columns:
sample_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it doesn't need to match any previous ID assigned to the GEM well.
molecule_h5: Path to the
sample_molecule_info.h5 file produced by cellranger count or multi. For example, if you processed your GEM well by calling cellranger count --id=ID in some directory
/DIR, this path would be
|For Cell Ranger v6.0+ and Loupe Browser v5.1.0+, the libraries CSV header should be sample_id,molecule_h5. For prior software versions, it should be library_id,molecule_h5.
Either make the CSV file in a text editor, or create it in Excel and save as a CSV file. Continuing with the example from the previous section, the Excel spreadsheet would look like this:
When you save it as a CSV, the result would look like this:
sample_id,molecule_h5 LV123,/opt/runs/LV123/outs/molecule_info.h5 LB456,/opt/runs/LB456/outs/molecule_info.h5 LP789,/opt/runs/LP789/outs/molecule_info.h5
In addition to the CSV columns expected by cellranger aggr, you may optionally supply columns containing library metadata (e.g., lab or sample origin). These custom library annotations do not affect the analysis pipeline but can be visualized downstream in the Loupe Browser (see Categories section below). Note that unlike other CSV inputs to Cell Ranger, these custom columns may contain characters outside the ASCII range (e.g., non-Latin characters).
For large studies involving samples run on multiple GEM wells, run cellranger count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr.
The cellranger aggr command takes an aggregation CSV file specifying the list of cellranger count output files (specifically the
molecule_info.h5 from each run), and produces a single feature-barcode matrix containing all the data.
When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence (see GEM wells).
By default, reads from each GEM well are subsampled such that all GEM wells have the same effective sequencing depth, measured in terms of reads that are confidently mapped to the transcriptome or assigned to the feature IDs per cell. However, it is possible to change the depth normalization mode (see Depth Normalization).
If Feature Barcode analysis is included, the input Feature Reference CSV file should be the same for each GEM well.
Starting from Cell Ranger v7.0, CRISPR Guide Capture libraries from multiple GEM wells can be aggregated with cellranger aggr. There are no changes to aggr inputs – the presence of CRISPR Guide Capture library information in the
molecule_info.h5 input files enables CRISPR aggregation. Normalization is enabled by default for both Gene Expression and CRISPR Guide Capture libraries; changes to normalization parameters affect both libraries. Note that protospacer calling is performed again on the combined data included in the cellranger aggr run. CRISPR aggregation generates the
crispr_analysis/ folder in the
outs/ directory. The structure of the
crispr_analysis folder is similar to the CRISPR outputs from cellranger count.
|After June 30, 2023, new Cell Ranger releases will no longer support Targeted Gene Expression analysis.
Targeted Gene Expression data is supported by cellranger aggr and can be aggregated with whole transcriptome Gene Expression data, provided that all GEM wells have matching chemistries and that the same target panel CSV file is used for all targeted samples.
The cellranger aggr command takes an aggregation CSV file specifying the list of cellranger multi per sample
sample_molecule_info.h5 files, and performs aggregation on any combination of Gene Expression and Feature Barcode (Antibody or CRISPR Guide Capture, Cell Multiplexing) data that are present in the individual sample outputs.
Consider two per sample datasets containing data from one 3' Cell Multiplexing experiment:
$ cd /opt/runs
$ cellranger multi --id=Run1 ...
... wait for pipeline to finish ...
To aggregate the datasets, you need to create a CSV containing the following columns:
|Unique identifier for this sample. This will be used for labeling purposes only.
|Path to the per sample
sample_molecule_info.h5 file generated by the cellranger multi pipeline. For example, if you processed Cell Multiplexing data by calling cellranger multi --id=ID in some directory
/DIR, and the sample was called
Sample1, this path would be
Apart from the change in the path to the per sample
sample_molecule_info.h5 file, the documentation on additional columns for creating categories, depth normalization, and batch correction are the same.
Either make the CSV file in a text editor, or create it in Excel and save as a CSV file. The Excel spreadsheet might look like this:
The CSV file will look like this:
sample_id,molecule_h5 Sample1,/opt/runs/Run1/outs/per_sample_outs/Sample1/count/sample_molecule_info.h5 Sample2,/opt/runs/Run1/outs/per_sample_outs/Sample2/count/sample_molecule_info.h5
These are the most common command line arguments (run cellranger aggr --help for a full list):
|Required. A unique run ID string: e.g.
|Required. Path to the aggregation CSV file containing the list of
|Optional. String specifying how to normalize depth across the input libraries. Valid values:
mapped (default) or
none (see Depth Normalization).
|Optional. Add this flag to skip secondary analysis which includes dimensionality reduction, clustering, and visualization. This is applicable if you plan to use cellranger reanalyze or your own custom analysis.
After specifying these input arguments, run cellranger aggr:
cd /home/jdoe/runs cellranger aggr --id=AGG123 \ --csv=AGG123_libraries.csv
The pipeline will begin to run, creating a new output folder named with the specified aggregation ID (e.g.
/home/jdoe/runs/AGG123). If this folder already exists, Cell Ranger will assume it is an existing pipestance and attempt to resume running it.
|The minimum requirement of 64GB RAM will allow cellranger aggr to aggregate up to 250k cells, after which more memory will be required.
The cellranger aggr pipeline generates output files that contain all of the data from the individual input runs, aggregated into single output files, for convenient multi-sample analysis. Refer to the Understanding Outputs section to learn about aggr output files.
When combining multiple samples into a single dataset with the cellranger aggr pipeline, you can assign categories and values to individual samples by adding columns to the cellranger aggr input spreadsheet. These category assignments propagate into Loupe Browser, where you can view them in Categories Mode to help determine genes that are differentially expressed between samples. For example, the following spreadsheet was used to aggregate the acute myeloid leukemia (AML) tutorial dataset:
Any columns in addition to
molecule_h5 will be converted into
categories, and the cells in each sample will be assigned to one of the values
in that category.
If you are aggregating libraries generated by different chemistry versions of the Single Cell Gene Expression Reagents, you might observe systematic differences in gene expression profiles between libraries. The cellranger aggr pipeline incorporates batch effect correction (algorithm details) to overcome this. To enable batch correction, include the following column in your aggregation CSV file:
batch: Optional. Unique identifier for the batch that this GEM well belongs to. Libraries with the same batch identifier are considered to be in the same batch.
For example, if the
LV123 sample in the previous example is a v2 library and the
LP789 samples are v3 libraries, set up the aggregation CSV file like this:
sample_id,molecule_h5,batch LV123,/opt/runs/LV123/outs/molecule_info.h5,v2_lib LB456,/opt/runs/LB456/outs/molecule_info.h5,v3_lib LP789,/opt/runs/LP789/outs/molecule_info.h5,v3_lib
v3_lib identifiers are example identifiers. Every sample from a given batch must have the same batch identifier, but otherwise the identifier text itself is arbitrary.
The cellranger aggr pipeline uses Chemistry Batch Correction when aggregating results from a combination of 5' and 3', or 3' v2 and 3' v3 Gene Expression data. Enabling Chemistry Batch Correction in this scenario improves the mixing of the batches in the t-SNE visualization and clustering results. We recommend using Chemistry Batch correction for these scenarios, however, residual batch effects may still be present and careful validation of the results is advised. In particular, for the V(D)J genes, the 5' Gene Expression assay will generally count the V gene segments of the immune receptor (e.g. TRBV12-1 or IGH4-2), while the 3' Gene Expression assay will count the C gene segments (e.g. TRBC or IGHA), which may pose additional analysis challenges.
The cellranger aggr pipeline can aggregate results that include Targeted Gene Expression analysis provided that the requirements above are met. Non-target genes are excluded from the feature-barcode matrices to conduct secondary analysis for all samples. Aggregated feature-barcode matrices follow the same convention as Targeted Gene Expression analysis: the filtered feature-barcode matrices exclude non-targeted genes, whereas the raw feature-barcode matrices include all genes.
The cellranger aggr pipeline supports the aggregation of Cell Multiplexing v3.1 data with Single Cell 3’ Gene Expression v3.1 data (i.e. "non-Cell Multiplexing data").
To combine Cell Multiplexing with non-Cell Multiplexing data, cellranger aggr (v6.0 and later) requires identical references (including feature/CMO reference). Therefore, non-Cell Multiplexing data must be re-run with the CMO reference file that specified Cell Multiplexing tags (CMOs) for the Cell Multiplexing data. These datasets can be run with either cellranger count (v3.0 and later) or cellranger multi (v5.0 and later). We recommend using the same version of Cell Ranger to generate inputs for cellranger aggr.
--feature-ref and use the
--no-libraries option. The cellranger count pipeline will generate a
molecule_info.h5, which can be used as input to the cellranger aggr pipeline. An example of the command is below (replace code in red with relevant file paths):
cellranger count --id=pbmc_1k_count \ --transcriptome=/path/to/transcriptome/GRCh38-2020-A \ --fastqs=/path/togex/fastqs/pbmc_1k_v3_fastqs/ \ --sample=pbmc_1k_v3 \ --feature-ref=/path/to/cmo-reference/cmo-ref.csv \ --no-libraries
[feature] section of the multi config CSV file. An example of the config file is below (replace code in red with relevant file paths):
[libraries] fastq_id,fastqs,feature_types pbmc_1k_v3,/path/togex/fastqs/pbmc_1k_v3_fastqs/,Gene Expression
sample_molecule_info.h5 file which is equivalent to the
molecule_info.h5 file from a cellranger count run. These files can be used as input to the cellranger aggr pipeline. An example of the multi pipeline command is below (replace code in red with relevant file paths):
cellranger multi --id=pbmc_1k_multi --csv=/path/to/config.csv
The Cell Multiplexing protocol has additional wash steps that may lead to depletion of more ambient mRNA from samples compared to non-Cell Multiplexing samples. This difference in mRNA may then introduce small batch effects between Cell Multiplexing and non-Cell Multiplexing datasets. However, observations from internal data show fairly small batch effects and batch effect correction is not needed in most cases. If the results for each sample reveals obvious differences between the Cell Multiplexing vs. non-Cell Multiplexing data, enabling chemistry batch correction may improve the mixing of the batches in the t-SNE visualization and clustering results.
If you are aggregating Single Cell 3' Gene Expression with Cell Multiplexing v3.1 samples with other chemistries (Single Cell 3’ v2, 5’ v2, etc.), the data can be treated similarly to Single Cell 3' v3.1 chemistry. For example, we recommend using chemistry batch correction when aggregating Single Cell 3’ v3.1 with v2 chemistry (see this article for more information about aggregating data from different chemistries).
When combining data from multiple GEM wells, the cellranger aggr pipeline automatically equalizes the average read depth per cell between groups before merging. This approach avoids artifacts that may be introduced due to differences in sequencing depth. It is possible to turn off normalization or change the way normalization is done. The
none option may be appropriate if you want to maximize sensitivity and plan to handle depth normalization in a downstream step.
There are two normalization modes:
none: Do not normalize at all.
mapped (Default): For each library type, subsample reads from higher-depth GEM wells until they all have, on average, an equal number of reads per cell that are confidently mapped to the transcriptome (Gene Expression) or assigned to known features (Feature Barcode Technology). If Targeted Gene Expression libraries are included, then normalization is performed on the basis of average reads per cell mapped confidently to the targeted transcriptome. The subsampling rates for Targeted Gene Expression libraries are all multiplied by 2 (provided all samples can achieve that depth). This is consistent with sequencing depth recommendations and is also done to avoid removing large fractions of reads from targeted libraries whenever they are combined with whole transcriptome libraries.
Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example,
AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.
This number, which indicates which GEM well the barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.