Software  ›   pipelines

# Customized Secondary Analysis using cellranger reanalyze

The cellranger reanalyze command reruns secondary analysis performed on the gene-barcode matrix (dimensionality reduction, clustering and visualization) using different parameter settings.

## Command Line Interface

These are the most common command line arguments (run cellranger reanalyze --help for a full list):

ArgumentDescription
--id=IDA unique run ID string: e.g. AGG123_reanalysis
--matrix=H5Path to a filtered_gene_bc_matrices_h5.h5 from a completed pipestance (either cellranger count or cellranger aggr).
--params=CSVPath to a CSV file containing a list of valid parameters and the values to use for them (see Parameters).

After specifying these input arguments, run cellranger reanalyze. In this example, we're reanalyzing the results of an aggregation named AGG123:

$cd /home/jdoe/runs$ ls -1 AGG123/outs/*.h5 # verify the input file exists
AGG123/outs/filtered_gene_bc_matrices_h5.h5
AGG123/outs/filtered_molecules.h5
AGG123/outs/raw_gene_bc_matrices_h5.h5
AGG123/outs/raw_molecules.h5
\$ cellranger reanalyze --id=AGG123_reanalysis \
--matrix=AGG123/outs/filtered_gene_bc_matrices_h5.h5 \
--params=AGG123_reanalysis.csv


The pipeline will begin to run, creating a new folder named with the reanalysis ID you specified (e.g. /home/jdoe/runs/AGG123_reanalysis) for its output. If this folder already exists, cellranger will assume it is an existing pipestance and attempt to resume running it.

## Pipeline Outputs

A successful run should conclude with a message similar to this:

2016-11-09 11:05:58 [runtime] (run:local)       ID.AGG123_reanalysis.SC_RNA_REANALYZER_CS.SC_RNA_ANALYZER.SUMMARIZE_ANALYSIS.fork0.join
2016-11-09 11:06:01 [runtime] (join_complete)   ID.AGG123_reanalysis.SC_RNA_REANALYZER_CS.SC_RNA_ANALYZER.SUMMARIZE_ANALYSIS

Outputs:
- Secondary analysis output CSV: /home/jdoe/runs/AGG123_reanalysis/outs/analysis_csv

Pipestance completed successfully!


Refer to the Analysis page for an explanation of the output.

## Parameters

The CSV file passed to --params should have 0 or more rows, one for every parameter that you want to customize. There is no header row. If a parameter is not specified in your CSV, its default value will be used. See Common Use Cases for some examples.

Here is a detailed description of each parameter. For parameters that subset the data, a default value of null indicates that no subsetting happens by default.

ParameterTypeDefault ValueRecommended RangeDescription
num_analysis_bcsintnullCannot be set higher than the available number of cells.Randomly subset data to N barcodes for all analysis. Reduce this parameter if you want to improve performance or simulate results from lower cell counts.
num_pca_bcsintnullCannot be set higher than the available number of cells.Randomly subset data to N barcodes when computing PCA projection (the most memory-intensive step). The PCA projection will still be applied to the full dataset, i.e. your final results will still reflect all the data. Try reducing this parameter if your analysis is running out of memory.
num_pca_genesintnullCannot be set higher than the number of genes in the reference transcriptome.Subset data to the top N genes (ranked by normalized dispersion) when computing PCA. Differential expression will still reflect all genes. Try reducing this parameter if your analysis is running out of memory.
num_principal_compsint1010-100, depending on the number of cell populations / clusters you expect to see.Compute N principal components for PCA. Setting this too high may cause spurious clusters to be called.
max_clustersint1010-50, depending on the number of cell populations / clusters you expect to see.Compute kmeans clustering using k values of 2 to N. Setting this too high may cause spurious clusters to be called.
tsne_input_pcsintnullCannot be set higher than the num_principal_comps parameter.Subset to top N principal components for TSNE. Change this parameter if you want to see how the TSNE plot changes when using fewer PCs, independent of the clustering / differential expression. You may find that TSNE is faster and/or the output looks better when using fewer PCs.
tsne_perplexityint3030-50TSNE perplexity parameter (see [the TSNE FAQ](https://lvdmaaten.github.io/tsne/) for more details). When analyzing 100k+ cells, increasing this parameter may improve TSNE results, but the algorithm will be slower.
tsne_thetafloat0.5Must be between 0 and 1.TSNE theta parameter (see [the TSNE FAQ](https://lvdmaaten.github.io/tsne/) for more details). Higher values yield faster, more approximate results (and vice versa). The runtime and memory performance of TSNE will increase dramatically if you set this below 0.25.
tsne_dimsint2Must be 2 or 3.Number of TSNE output dimensions. Set this to 3 to produce 3D TSNE plots (note: runtime will increase significantly).
tsne_max_iterint10001000-10000Number of total TSNE iterations. Try increasing this if TSNE results do not look good on larger numbers of cells. Runtime increases linearly with number of iterations.
tsne_stop_lying_iterint250Cannot be set higher than tsne_max_iter.Iteration at which TSNE learning rate is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells.
tsne_mom_switch_iterint250Cannot be set higher than tsne_max_iter.Iteration at which TSNE momentum is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells. Cannot be set higher than tsne_max_iter.
random_seedint0any integerRandom seed. Due to the randomized nature of the algorithms, changing this will produce slightly different results. If the TSNE results don't look good, try running multiple times with different seeds and pick the TSNE that looks best.

## Common Use Cases

These examples illustrate what you should put in your --params CSV file in some common situations.

### 1. More PCs and Clusters

For very large / diverse cell populations, the defaults may not capture the full variation between cells. In that case, try increasing the number of principal components and / or clusters. To run PCA with 50 components and k-means with up to 30 clusters, put this in your CSV:

num_principal_comps,50
max_clusters,30


### 2. Less Memory Usage

You can limit the memory usage of the analysis by computing the PCA projection on a subset of cells and genes. This is expecially useful for large datasets (100k+ cells). If you have 100k cells, it's completely reasonable to use only 50% of them for PCA - the memory usage will be cut in half, but you'll still be well equipped to detect rare subpopulations. Limiting the number of genes will reduce memory even further. To compute the PCA projection using 50000 cells and 3000 genes, put this in your CSV:

num_pca_bcs,50000
num_pca_genes,3000


Note: Subsetting of cells is done randomly, to avoid bias. Subsetting of genes is done by binning genes by their mean expression across cells, then measuring the dispersion (a variance-like parameter) of each gene's expression normalized to the other genes in its bin.