Cell Ranger ATAC1.1, printed on 11/21/2024
The cellranger-atac reanalyze command reruns secondary analysis performed on the peak-barcode matrix (dimensionality reduction, clustering and visualization) using different parameter settings.
These are the most common command line arguments (run cellranger-atac reanalyze --help for a full list):
Argument | Description |
---|---|
--id=ID | A unique run ID string: e.g. AGG123_reanalysis |
--peaks=BED | Path to a peaks.bed from a completed pipestance (cellranger-atac count, cellranger-atac reanalyze or cellranger-atac aggr) or custom peaks. |
--fragments=TSV | Path to a block-gzipped TSV file of fragments from a completed pipestance (cellranger-atac count, cellranger-atac reanalyze or cellranger-atac aggr). A tabix (.tbi) index file of the same name is expected to be present in the same directory, otherwise specify using optional argument --index |
--reference=PATH | Path to a Cell Ranger ATAC reference. |
--params=CSV | (optional) Path to a CSV file containing a list of valid parameters and the values to use for them (see Parameters). |
--index=TBI | (optional) A tabix (.tbi) index corresponding to the input fragments file. Specify this if the filename differs from that of the fragments file, or if the fragments file and its index are located in different paths. |
--agg=CSV | (optional) Path to a CSV file that was used for cellranger-atac aggr. This allows you to retain any metadata associated with the samples for display in Loupe Cell Browser. |
--barcodes=CSV | (optional) Path to a CSV file in the singlecell.csv format with a list of cell associated barcodes to use for reanalysis. All barcodes must be present in the fragments file. If this option is not provided, the pipeline does cell calling. |
--force-cells=NUM | (optional) Force pipeline to use this number of cells during the cell detection algorithm. Use this if the number of cells estimated by Cell Ranger ATAC is not consistent with the barcode rank plot. |
After specifying these input arguments, run cellranger-atac reanalyze. In this example, we're reanalyzing the results of an aggregation named AGG123
:
$ cd /home/jdoe/runs $ ls -1 AGG123/outs/*.gz # verify the input file exists AGG123/outs/fragments.tsv.gz $ cellranger-atac reanalyze --id=AGG123_reanalysis \ --peaks=AGG123/outs/peaks.bed \ --params=AGG123_reanalysis.csv \ --reference=/home/jdoe/refs/hg19 \ --fragments=/home/jdoe/runs/AGG123/outs/fragments.tsv.gz
The pipeline will begin to run, creating a new folder named with the reanalysis ID you specified (e.g. /home/jdoe/runs/AGG123_reanalysis
) for its output. If this folder already exists, cellranger-atac will assume it is an existing pipestance and attempt to resume running it.
A successful run should conclude with a message similar to this:
2019-03-22 12:45:22 [runtime] (run:hydra) ID.AGG123_reanalysis.SC_ATAC_REANALYZER_CS.SC_ATAC_REANALYZER.CLOUPE_PREPROCESS.fork0.join 2019-03-22 12:46:04 [runtime] (join_complete) ID.AGG123_reanalysis.SC_ATAC_REANALYZER_CS.SC_ATAC_REANALYZER.CLOUPE_PREPROCESS 2019-03-22 12:46:04 [runtime] VDR killed 270 files, 18 MB. Outputs: - Summary of all data metrics: /home/jdoe/runs/AGG123_reanalysis/outs/summary.json - csv summarizing important metrics and values: /home/jdoe/runs/AGG123_reanalysis/outs/summary.csv - Per-barcode fragment counts & metrics: /home/jdoe/runs/AGG123_reanalysis/outs/singlecell.csv - Raw peak barcode matrix in hdf5 format: /home/jdoe/runs/AGG123_reanalysis/outs/raw_peak_bc_matrix.h5 - Raw peak barcode matrix in mex format: /home/jdoe/runs/AGG123_reanalysis/outs/raw_peak_bc_matrix - Filtered peak barcode matrix in hdf5 format: /home/jdoe/runs/AGG123_reanalysis/outs/filtered_peak_bc_matrix.h5 - Filtered peak barcode matrix in mex format: /home/jdoe/runs/AGG123_reanalysis/outs/filtered_peak_bc_matrix - Directory of analysis files: /home/jdoe/runs/AGG123_reanalysis/outs/analysis - HTML file summarizing aggregation analysis : /home/jdoe/runs/AGG123_reanalysis/outs/web_summary.html - Filtered tf barcode matrix in hdf5 format: /home/jdoe/runs/AGG123_reanalysis/outs/filtered_tf_bc_matrix.h5 - Filtered tf barcode matrix in mex format: /home/jdoe/runs/AGG123_reanalysis/outs/filtered_tf_bc_matrix - Loupe Cell Browser input file: /home/jdoe/runs/AGG123_reanalysis/outs/cloupe.cloupe - Annotation of peaks with genes: /home/jdoe/runs/AGG123_reanalysis/outs/peak_annotation.tsv - Barcoded and aligned fragment file: /home/jdoe/runs/AGG123_reanalysis/outs/fragments.tsv.gz - Fragment file index: /home/jdoe/runs/AGG123_reanalysis/outs/fragments.tsv.gz.tbi Pipestance completed successfully!
Refer to the Analysis page for an explanation of the output.
The CSV file passed to --params
should have 0 or more rows, one for every parameter that you want to customize. There is no header row. If a parameter is not specified in your CSV, its default value will be used. See Common Use Cases for some examples.
Here is a detailed description of each parameter. For parameters that subset the data, a default value of null
indicates that no subsetting happens by default.
Note that some of the allowed parameters (e.g. num_comps) listed below are named slightly different for cellranger-atac reanalyze as compared to the allowed parameters for cellranger reanalyze. |
Parameter | Type | Default Value | Recommended Range | Description |
---|---|---|---|---|
dim_reduce | str | lsa | [lsa, pca, plsa] | Pick dimensionality reduction technique. |
num_analysis_bcs | int | null | Cannot be set higher than the available number of cells or lower than zero. | Randomly subset data to N barcodes for all analyses. Reduce this parameter if you want to improve performance or simulate results from lower cell counts. Resets to available number of cells if specified to be higher than it. |
num_dr_bcs | int | null | Cannot be set higher than the available number of cells. | Randomly subset data to N barcodes when computing PCA projection (the most memory-intensive step). The PCA projection will still be applied to the full dataset, i.e. your final results will still reflect all the data. Try reducing this parameter if your analysis is running out of memory. |
num_dr_features | int | null | Cannot be set higher than the number of peaks in the bed file. | Subset data to the top N features (that is, peaks, ranked by normalized dispersion) when computing LSA/PCA/PLSA projection (the most memory intensive step). The dimreduce projection will still be applied to the full dataset, i.e. your final results will still reflect all the data. Try reducing this parameter if your analysis is running out of memory. |
num_comps | int | 15 | 10-100 (20 for PLSA), depending on the number of cell populations / clusters you expect to see. | Compute N principal components for LSA/PCA/PLSA. Setting this too high may cause spurious clusters to be called. |
graphclust_neighbors | int | 0 | 10-500, depending on desired granularity. | Number of nearest-neighbors to use in the graph-based clustering. Lower values result in higher-granularity clustering. The actual number of neighbors used is the maximum of this value and that determined by neighbor_a and neighor_b . Set this value to zero to use those values instead. |
neighbor_a | float | -230.0 | Determines how clustering granularity scales with cell count. | The number of nearest neighbors, k, used in the graph-based clustering is computed as follows: k = neighbor_a + neighbor_b * log10(n_cells). The actual number of neighbors used is the maximum of this value and graphclust_neighbors . |
neighbor_b | float | 120.0 | Determines how clustering granularity scales with cell count. | The number of nearest neighbors, k, used in the graph-based clustering is computed as follows: k = neighbor_a + neighbor_b * log10(n_cells). The actual number of neighbors used is the maximum of this value and graphclust_neighbors . |
max_clusters | int | 10 | 10-50, depending on the number of cell populations / clusters you expect to see. | Compute K-means clustering using K values of 2 to N. Setting this too high may cause spurious clusters to be called. |
tsne_input_pcs | int | null | Cannot be set higher than the num_comps parameter. | Subset to top N principal components for TSNE. Change this parameter if you want to see how the TSNE plot changes when using fewer PCs, independent of the clustering / differential expression. You may find that TSNE is faster and/or the output looks better when using fewer PCs. |
tsne_perplexity | int | 30 | 30-50 | TSNE perplexity parameter (see the TSNE FAQ for more details). When analyzing 100k+ cells, increasing this parameter may improve TSNE results, but the algorithm will be slower. |
tsne_theta | float | 0.5 | Must be between 0 and 1. | TSNE theta parameter (see the TSNE FAQ for more details). Higher values yield faster, more approximate results (and vice versa). The runtime and memory performance of TSNE will increase dramatically if you set this below 0.25. |
tsne_max_dims | int | 2 | Must be 2 or 3. | Maximum number of TSNE output dimensions. Set this to 3 to produce both 2D and 3D TSNE projections (note: runtime will increase significantly). |
tsne_max_iter | int | 1000 | 1000-10000 | Number of total TSNE iterations. Try increasing this if TSNE results do not look good on larger numbers of cells. Runtime increases linearly with number of iterations. |
tsne_stop_lying_iter | int | 250 | Cannot be set higher than tsne_max_iter . | Iteration at which TSNE learning rate is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells. |
tsne_mom_switch_iter | int | 250 | Cannot be set higher than tsne_max_iter . | Iteration at which TSNE momentum is reduced. Try increasing this if TSNE results do not look good on larger numbers of cells. Cannot be set higher than tsne_max_iter . |
random_seed | int | 0 | any integer | Random seed. Due to the randomized nature of the algorithms, changing this will produce slightly different results. If the TSNE results don't look good, try running multiple times with different seeds and pick the TSNE that looks best. |
These examples illustrate what you should put in your --params
CSV file in some common situations.
For very large / diverse cell populations, the defaults may not capture the full variation between cells. In that case, try increasing the number of principal components and / or clusters. To run dimensionality reduction with 50 components and k-means with up to 30 clusters, put this in your CSV:
num_comps,50 max_clusters,30
You can limit the memory usage of the analysis by computing the LSA projection on a subset of cells and features. This is especially useful for large datasets (100k+ cells). If you have 100k cells, it's completely reasonable to use only 50% of them for LSA - the memory usage will be cut in half, but you'll still be well equipped to detect rare subpopulations. Limiting the number of features will reduce memory even further. To compute the LSA projection using 50000 cells and 3000 peaks, put this in your CSV:
num_dr_bcs,50000 num_dr_features,3000
Note: Subsetting of cells is done randomly, to avoid bias. Subsetting of features is done by binning features by their mean expression across cells, then measuring the dispersion (a variance-like parameter) of each gene's expression normalized to the other features in its bin.