10x Genomics
Chromium Single Cell Gene Expression

Cell Ranger7.1, printed on 03/29/2025

Customized Secondary Analysis with cellranger reanalyze

The cellranger reanalyze command reruns secondary analysis performed on the feature-barcode matrix (dimensionality reduction, clustering, and visualization) using different parameter settings.

Currently, cellranger reanalyze doesn't support the reanalysis of Feature Barcode data.

Command line interface
Output files
Parameters
Common use cases

Command line interface

These are the most common command line arguments (run cellranger reanalyze --help for a full list):

Argument	Description
`--id=ID`	A unique run ID string: e.g. `AGG123_reanalysis`
`--matrix=H5`	Path to a `filtered_feature_bc_matrix.h5` or `raw_feature_bc_matrix.h5` or `sample_filtered_feature_bc_matrix.h5` from a completed pipestance (from either `cellranger count`, `cellranger multi`, or `cellranger aggr`). Use the `raw_feature_bc_matrix.h5` when specifying a value for `--force-cells` that exceeds the original cell count.
`--params=CSV`	Optional. Path to a CSV file containing a list of valid parameters and the values to use for them (see Parameters).
`--agg=CSV`	Optional. Path to a CSV file that was used for `cellranger aggr`. This allows you to retain any metadata associated with the samples for display in Loupe Browser. This argument is required if you want to enable Chemistry Batch Correction in your reanalysis.
`--barcodes=list`	Optional. Path to a file containing a list of barcodes (one barcode per line) to use for reanalysis. The first line of file must be a header called `Barcode`. Header is case sensitive. All barcodes must be present in the matrix.
`--genes=list`	Optional. Path to a file containing a list of gene IDs (one gene ID per line) to use for reanalysis (corresponding to the `gene_id` field of the reference GTF). The first line of the genes file must be a header called `Gene`. Header is case sensitive. All gene IDs included in the list must be present in the matrix. Only gene features are used in the secondary analysis. Feature Barcode features are ignored. An example `gene.txt` supplied to this flag: Gene ENSG00000243485 ENSG00000237613 ENSG00000186092 ENSG00000238009 ENSG00000239945 ENSG00000239906 ENSG00000241599 ENSG00000236601 ENSG00000284733 ENSG00000235146 ENSG00000284662
`--exclude-genes=list`	Optional. Path to a file containing a list of gene IDs (one gene ID per line) to exclude for reanalysis (corresponding to the `gene_id` field of the reference GTF). All gene IDs must be present in the matrix. The first line of the genes file must be a header called `Gene`. The exclusion can be applied with or without the `--genes` list. When a `--genes` list is supplied, `--exclude-genes` searches within the supplied list for exclusion. Note that only gene features are used in secondary analysis. Feature Barcode features are ignored.
`--force-cells=NUM`	Optional. Force pipeline to use this number of cells, bypassing the cell detection algorithm. Use this if the number of cells estimated by Cell Ranger is not consistent with the barcode rank plot. If specifying a value that exceeds the original cell count, you must use the `raw_feature_bc_matrices_h5.h5`. Starting with Cell Ranger 6.1, it is no longer possible to run --force-cells of an aggr output in reanalyze with more cells than were originally called.

After specifying these input arguments, run cellranger reanalyze. In this example, we're reanalyzing the results of an aggregation named AGG123 from cellranger count outputs:

$ cd /home/jdoe/runs
$ ls -1 AGG123/outs/*.h5 # verify the input file exists
AGG123/outs/filtered_feature_bc_matrix.h5
AGG123/outs/filtered_molecules.h5
AGG123/outs/raw_feature_bc_matrix.h5
AGG123/outs/raw_molecules.h5
$ cellranger reanalyze --id=AGG123_reanalysis \
                       --matrix=AGG123/outs/filtered_feature_bc_matrix.h5 \
                       --params=AGG123_reanalysis.csv

The pipeline will begin to run, creating a new folder named with the reanalysis ID you specified (e.g. /home/jdoe/runs/AGG123_reanalysis) for its output. If this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it.

Output files

A successful run should conclude with a message similar to this:

2018-10-09 11:05:58 [runtime] (run:local)       ID.AGG123_reanalysis.SC_RNA_REANALYZER_CS.SC_RNA_ANALYZER.SUMMARIZE_ANALYSIS.fork0.join
2018-10-09 11:06:01 [runtime] (join_complete)   ID.AGG123_reanalysis.SC_RNA_REANALYZER_CS.SC_RNA_ANALYZER.SUMMARIZE_ANALYSIS
 
Outputs:
- Secondary analysis output CSV:          /home/jdoe/runs/AGG123_reanalysis/outs/analysis_csv
- Secondary analysis web summary:         /home/jdoe/runs/AGG123_reanalysis/outs/web_summary.html
- Copy of the input parameter CSV:        /home/jdoe/runs/AGG123_reanalysis/outs/params_csv.csv
- Copy of the input aggregation CSV:      /home/jdoe/runs/AGG123_reanalysis/outs/aggregation_csv.csv
- Loupe Browser file:                /home/jdoe/runs/AGG123_reanalysis/outs/cloupe.cloupe
- Filtered feature-barcode matrices MEX:  /home/jdoe/runs/AGG123_reanalysis/outs/filtered_feature_bc_matrix
- Filtered feature-barcode matrices HDF5:  /home/jdoe/runs/AGG123_reanalysis/outs/filtered_feature_bc_matrix.h5
 
Pipestance completed successfully!

Refer to the Analysis page for an explanation of the secondary analyses outputs.

Parameters

The CSV file passed to --params should have 0 or more rows, one for every parameter that you want to customize. There is no header row. If a parameter is not specified in your CSV, its default value will be used. See Common Use Cases for some examples.

Here is a detailed description of each parameter. For parameters that subset the data, a default value of null indicates that no subsetting happens by default.

Parameter	Type	Default Value	Recommended Range	Description
`num_analysis_bcs`	int	null	Cannot be set higher than the available number of cells.	Randomly subset data to N barcodes for all analysis. Reduce this parameter if you want to improve performance or simulate results from lower cell counts.
`num_pca_bcs`	int	null	Cannot be set higher than the available number of cells.	Randomly subset data to N barcodes when computing PCA projection (the most memory-intensive step). The PCA projection will still be applied to the full dataset, i.e. your final results will still reflect all the data. Try reducing this parameter if your analysis is running out of memory.
`num_pca_genes`	int	null	Cannot be set higher than the number of genes in the reference transcriptome.	Subset data to the top N genes (ranked by normalized dispersion) when computing PCA. Differential expression will still reflect all genes. Try reducing this parameter if your analysis is running out of memory.
`num_principal_comps`	int	10	10-100, depending on the number of cell populations / clusters you expect to see.	Compute N principal components for PCA. Setting this too high may cause spurious clusters to be called. The default value is 100 when the chemistry batch correction is enabled.
`cbc_knn`	int	10	5-20	Specify the number of nearest neighbors used to identify mutual nearest neighbors. Setting this too high will increase runtime and may cause out of memory error. See Chemistry Batch Correction page for more details.
`cbc_alpha`	float	0.1	0.05-0.5	Specify the threshold of the percentage of matched cells between two batches, which is used to determine if the batch pair will be merged. See Chemistry Batch Correction page for more details.
`cbc_sigma`	float	150	10-500	Specify the bandwidth of the Gaussian smoothing kernel used to compute the correction vector for each cell. See Chemistry Batch Correction page for more details.
`cbc_realign_panorama`	bool	false	[true, false]	Specify if two batches will be merged if they are already in the same panorama. Setting this to True will usually improve the performance, but will also increase runtime and memory usage. See Chemistry Batch Correction page for more details.
`graphclust_neighbors`	int	0	10-500, depending on desired granularity	Number of nearest-neighbors to use in the graph-based clustering. Lower values result in higher-granularity clustering. The actual number of neighbors used is the maximum of this value and that determined by `neighbor_a` and `neighbor_b`. Set this value to zero to use those values instead.
`neighbor_a`	float	-230.0	Determines how clustering granularity scales with cell count.	The number of nearest neighbors, k, used in the graph-based clustering is computed as follows: k = neighbor_a + neighbor_b * log10(n_cells). The actual number of neighbors used is the maximum of this value and `graphclust_neighbors`.
`neighbor_b`	float	120.0	Determines how clustering granularity scales with cell count.	The number of nearest neighbors, k, used in the graph-based clustering is computed as follows: k = neighbor_a + neighbor_b * log10(n_cells). The actual number of neighbors used is the maximum of this value and `graphclust_neighbors`.
`max_clusters`	int	10	10-50, depending on the number of cell populations / clusters you expect to see.	Compute K-means clustering using K values of 2 to N. Setting this too high may cause spurious clusters to be called.
`tsne_input_pcs`	int	null	Cannot be set higher than the `num_principal_comps` parameter.	Subset to top N principal components for TSNE. Change this parameter if you want to see how the TSNE plot changes when using fewer PCs, independent of the clustering / differential expression. You may find that TSNE is faster and/or the output looks better when using fewer PCs.
`tsne_perplexity`	int	30	30-50	TSNE perplexity parameter (see the TSNE FAQ for more details). When analyzing 100k+ cells, increasing this parameter may improve TSNE results, but the algorithm will be slower.
`tsne_theta`	float	0.5	Must be between 0 and 1.	TSNE theta parameter (see the TSNE FAQ for more details). Higher values yield faster, more approximate results (and vice versa). The runtime and memory performance of TSNE will increase dramatically if you set this below 0.25.
`tsne_max_dims`	int	2	Must be 2 or 3.	Maximum number of TSNE output dimensions. Set this to 3 to produce both 2D and 3D TSNE projections (note: runtime will increase significantly).
`tsne_max_iter`	int	1000	1000-10000	Number of total TSNE iterations. Try increasing this if TSNE results do not look good on larger numbers of cells. Runtime increases linearly with number of iterations.
`tsne_stop_lying_iter`	int	250	Cannot be set higher than `tsne_max_iter`.	Iteration at which TSNE learning rate is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells.
`tsne_mom_switch_iter`	int	250	Cannot be set higher than `tsne_max_iter`.	Iteration at which TSNE momentum is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells. Cannot be set higher than `tsne_max_iter`.
`umap_input_pcs`	int	null	Cannot be set higher than the `num_principal_comps` parameter.	Subset to top N principal components for UMAP. Change this parameter if you want to see how the UMAP plot changes when using fewer PCs, independent of the clustering / differential expression. You may find that UMAP is faster and/or the output looks better when using fewer PCs.
`umap_n_neighbors`	int	30	[5, 50]	Determines the number of neighboring points used in local approximations of manifold structure. Larger values will usually result in more global structure at the loss of detailed local structure.
`umap_max_dims`	int	2	Must be 2 or 3.	Maximum number of UMAP output dimensions. Set this to 3 to produce both 2D and 3D UMAP projections.
`umap_min_dist`	float	0.3	[0.001, 0.5]	Controls how tightly the embedding is allowed to pack points together. Larger values make embedded points are more evenly distributed, while smaller values make the embedding more accurately with regard to the local structure.
`umap_metric`	string	correlation	list of supported metrics	Determines how the distance is computed in the input space.
`random_seed`	int	0	any integer	Random seed. Due to the randomized nature of the algorithms, changing this will produce slightly different results. If the TSNE or UMAP results don't look good, try running multiple times with different seeds and pick the TSNE or UMAP that looks best.

Common use cases

These examples illustrate what you should put in your --params CSV file in some common situations.

1. More principal components and clusters

For very large / diverse cell populations, the defaults may not capture the full variation between cells. In that case, try increasing the number of principal components and / or clusters. To run PCA with 50 components and k-means with up to 30 clusters, put this in your CSV:

num_principal_comps,50
max_clusters,30

2. Less memory usage

You can limit the memory usage of the analysis by computing the PCA projection on a subset of cells and genes. This is useful for large datasets (100k+ cells). If you have 100k cells, it's completely reasonable to use only 50% of them for PCA - the memory usage will be cut in half, but you'll still be well equipped to detect rare subpopulations. Limiting the number of genes will reduce memory even further. To compute the PCA projection using 50000 cells and 3000 genes, put this in your CSV:

num_pca_bcs,50000
num_pca_genes,3000

Note: Subsetting of cells is done randomly, to avoid bias. Subsetting of genes is done by binning genes by their mean expression across cells, then measuring the dispersion (a variance-like parameter) of each gene's expression normalized to the other genes in its bin.

Cell Ranger

Loupe

10x Genomics
Chromium Single Cell Gene Expression

Customized Secondary Analysis with cellranger reanalyze

Table of Contents

Command line interface

Output files

Parameters

Common use cases

1. More principal components and clusters

2. Less memory usage

About

Legal Notices

Resources

Headquarters

Social

Cell Ranger

Loupe

10x GenomicsChromium Single Cell Gene Expression

Customized Secondary Analysis with cellranger reanalyze

Table of Contents

Command line interface

Output files

Parameters

Common use cases

1. More principal components and clusters

2. Less memory usage

10x Genomics
Chromium Single Cell Gene Expression