Cell Ranger ATAC2.1, printed on 11/17/2024
The cellranger-atac
pipeline requires FASTQ files as input, which
will typically come from running cellranger-atac mkfastq,
a 10x Genomics-aware convenience wrapper for bcl2fastq. However, it is possible to
use FASTQ files from other sources, such as bcl2fastq or BCL Convert from Illumina®, a published dataset, or our bamtofastq. Here are the arguments available for
specifying which FASTQ files cellranger-atac should use:
Argument | Brief Description |
---|---|
--fastqs |
Required. The folder containing the FASTQ files to be analyzed. Generally, this will be the fastq_path folder generated by cellranger-atac mkfastq. If the files are in multiple folders, for instance because one library was sequenced across multiple flow cells, supply a comma-separated list of paths. |
--sample |
Optional. Sample name to analyze. This will be as specified in the sample sheet supplied to mkfastq or bcl2fastq. Multiple names may be supplied as a comma-separated list, in which case they will be treated as one sample. |
--lanes | Optional. Lanes associated with this sample. Defaults to all lanes. |
--indices | Deprecated/Optional. Only used for output from cellranger-atac demux. Sample indices associated with this sample |
There is a wide range of ways bcl2fastq and mkfastq can be invoked, resulting in a wide range of potential file names and locations as the output. Since finding the right FASTQ files to process and the right arguments to process those files as desired can be confusing, some common scenarios are illustrated here.
Input FASTQ files should conform to the naming conventions of bcl2fastq and mkfastq, and are specified by providing the path to the folder containing them (via the --fastqs
argument) and then optionally restricting the selection by specifying the samples and or lanes of interest.
To assist users, this page illustrates examples of how to handle common scenarios involving different FASTQ file folder hierarchies or naming conventions.
For the Single Cell ATAC chemistry, the barcode is sequenced as part of the i5 index read. Both mkfastq and bcl2fastq conventionally associate R2 with the i5 index read, and R3 with read 2. Thus read 1, barcode, read 2, sample index are associated with R1, R2, R3, I1 respectively. This is reflected in the output files shown in the output examples in this guide. |
Where are your FASTQ files?
How are they named?
How did I get here?
By running mkfastq with a simple CSV layout file or Illumina® Experiment Manager samplesheet, or by running bcl2fastq directly (with an IEM samplesheet) on a flow cell. If you ran mkfastq, your files will
be in a (MKFASTQ_ID)/outs/fastq_path
folder, and your file hierarchy probably looks something like this:
MKFASTQ_ID |-- MAKE_FASTQS_CS `-- outs |-- fastq_path |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L001_R3_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L002_R3_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | |-- test_sample1_S1_L003_R2_001.fastq.gz | `-- test_sample1_S1_L003_R3_001.fastq.gz |-- test_sample2 | |-- test_sample2_S2_L001_I1_001.fastq.gz | |-- test_sample2_S2_L001_R1_001.fastq.gz | |-- test_sample2_S2_L001_R2_001.fastq.gz | |-- test_sample2_S2_L001_R3_001.fastq.gz | |-- test_sample2_S2_L002_I1_001.fastq.gz | |-- test_sample2_S2_L002_R1_001.fastq.gz | |-- test_sample2_S2_L002_R2_001.fastq.gz | |-- test_sample2_S2_L002_R3_001.fastq.gz | |-- test_sample2_S2_L003_I1_001.fastq.gz | |-- test_sample2_S2_L003_R1_001.fastq.gz | |-- test_sample2_S2_L003_R2_001.fastq.gz | `-- test_sample2_S2_L003_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz
If you ran bcl2fastq directly, then the output root folder would be where
fastq_path
is in the hierarchy above.
"Expected sample name prefixes" means you have one set of FASTQ files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM sample sheet. Other situations described later on this page deal with the presence of four separate sets of files (four "samples" from bcl2fastq's point of view) per single biological sample/library.
For more information on the naming conventions, please visit the Illumina® support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.
The table below describes the arguments you would pass into any analysis pipeline to target the right FASTQ files in this scenario. Be sure to substitute the capitalized text as appropriate. Also note that in most cases you will be passing a single sample into any given pipeline. Exceptions to this are described in the documentation for the individual pipelines. The "All Samples" entries in this table are provided for technical completeness.
Situation | Argument+Value |
---|---|
All samples (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path |
All samples (mkfastq), multiple flow cells | --fastqs=MKFASTQ_ID/outs/fastq_path1,MKFASTQ_ID/outs/fastq_path2 |
All samples (bcl2fastq direct) | --fastqs=/PATH/TO/bcl2fastq_output |
Process test_sample1 from all lanes (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=test_sample1 |
Process test_sample1 from lane 1 only (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=test_sample1 \ --lanes=1 |
Process test_sample1 and test_sample2 as a single merged sample (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=test_sample1,test_sample2 |
How did I get here?
It is likely that an input sample sheet was used that explicitly separated the four oligos in a 10x Genomics sample index set into four separate sample names. You may see a file hierarchy like this:
bcl2fastq_output |-- HFLC5BBXX |-- SI-GA-A1_1 | |-- SI-GA-A1_1_S1_L001_I1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R2_001.fastq.gz | `-- SI-GA-A1_1_S1_L001_R3_001.fastq.gz |-- SI-GA-A1_2 | |-- SI-GA-A1_2_S2_L001_I1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R2_001.fastq.gz | `-- SI-GA-A1_2_S2_L001_R3_001.fastq.gz |-- SI-GA-A1_3 | |-- SI-GA-A1_3_S3_L001_I1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R2_001.fastq.gz | `-- SI-GA-A1_3_S3_L001_R3_001.fastq.gz |-- SI-GA-A1_4 | |-- SI-GA-A1_4_S4_L001_I1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R2_001.fastq.gz | `-- SI-GA-A1_4_S4_L001_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz |-- Undetermined_S0_L001_R1_001.fastq.gz |-- Undetermined_S0_L001_R2_001.fastq.gz `-- Undetermined_S0_L001_R3_001.fastq.gz
You probably
want to be able to merge All samples from the SI-GA-A1
index into a single analysis. If you only
run one index at a time, you will see a smaller number of reads than expected, which may
translate to lower coverage or cell count than you expect for your experiment.
Situation | Argument+Value |
---|---|
All samples | --fastqs=MKFASTQ_ID/outs/fastq_path |
Process all SI-GA-A1 reads in a single analysis | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=SI-GA-A1_1,SI-GA-A1_2,SI-GA-A1_3,SI-GA-A1_4 |
Only process first sample index | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=SI-GA-A1_1 |
How did I get here?
An Illumina® Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy likely looks something like this:
fastq_path |-- Reports |-- Stats |-- test_sample_S1_L001_I1_001.fastq.gz |-- test_sample_S1_L001_R1_001.fastq.gz |-- test_sample_S1_L001_R2_001.fastq.gz |-- test_sample_S1_L001_R3_001.fastq.gz |-- test_sample_S1_L002_I1_001.fastq.gz |-- test_sample_S1_L002_R1_001.fastq.gz |-- test_sample_S1_L002_R2_001.fastq.gz |-- test_sample_S1_L002_R3_001.fastq.gz |-- test_sample_S1_L003_I1_001.fastq.gz |-- test_sample_S1_L003_R1_001.fastq.gz |-- test_sample_S1_L003_R2_001.fastq.gz |-- test_sample_S1_L003_R3_001.fastq.gz |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz
This is fine; you would use the same arguments as if the FASTQs were organized into subfolders within the output folder.
Situation | Argument+Value |
---|---|
All samples (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path |
All samples (bcl2fastq direct) | --fastqs=/PATH/TO/bcl2fastq_output |
Process test_sample from all lanes (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=test_sample |
Process test_sample from lane 1 only (mkfastq) | --fastqs=MKFASTQ_ID/outs/fastq_path \ --sample=test_sample \ --lanes=1 |
How did I get here?
It is likely that FASTQ files have been transferred from either a mkfastq
or
bcl2fastq
run into another folder. They still retain the names assigned by
bcl2fastq
, which is a combination of sample name, sample order, lane, read type,
and chunk. Your file hierarchy may look like this:
PROJECT_FOLDER |-- MySample_S1_L001_I1_001.fastq.gz |-- MySample_S1_L001_R1_001.fastq.gz |-- MySample_S1_L001_R2_001.fastq.gz |-- MySample_S1_L001_R3_001.fastq.gz |-- MySample_S1_L002_I1_001.fastq.gz |-- MySample_S1_L002_R1_001.fastq.gz |-- MySample_S1_L002_R2_001.fastq.gz |-- MySample_S1_L002_R3_001.fastq.gz
This is fine; since the files are named according to the bcl2fastq
standard,
you would use the same arguments as if the FASTQs were organized into a flow cell
folder or mkfastq
output folder.
Situation | Argument+Value |
---|---|
All samples | --fastqs=/PATH/TO/PROJECT_FOLDER |
Process MySample from all lanes | --fastqs=/PATH/TO/PROJECT_FOLDER \ --sample=MySample |
Process MySample from lane 1 only | --fastqs=/PATH/TO/PROJECT_FOLDER \ --sample=MySample \ --lanes=1 |
How did I get here?
The 10x Genomics demux pipeline was used to demultiplex the flow cell instead of mkfastq. This pipeline has been deprecated, but you have still got a job to do. Your file hierarchy likely has many files in it, named as such:
demux_id |-- BCL_PROCESSOR_CS `-- outs |-- fastq_path |-- read-I1_si-AAAAAAAA_lane-001-chunk-001.fastq.gz ... |-- read-I1_si-TTTTTTTT_lane-002-chunk-001.fastq.gz |-- read-I1_si-X_lane-002-chunk-001.fastq.gz |-- read-RA_si-AAAAAAAA_lane-001-chunk-001.fastq.gz ... |-- read-RA_si-TTTTTTTT_lane-002-chunk-001.fastq.gz |-- read-RA_si-X_lane-002-chunk-001.fastq.gz
To ingest the correct FASTQ files from a demux
run, you will need to know the
10x sample index or oligos associated with your sample. That will select the correct files
from the sample indices in your folder:
Situation | Argument+Value |
---|---|
All samples | --fastqs=/PATH/TO/PROJECT_FOLDER |
Process sample associated with SI-GA-A1 | --fastqs=/PATH/TO/PROJECT_FOLDER \ --indices=SI-GA-A1 |
Process sample associated with SI-GA-A1, lane 1 only | --fastqs=/PATH/TO/PROJECT_FOLDER \ --indices=SI-GA-A1 \ --lanes=1 |
Process samples by sample index oligo | --fastqs=/PATH/TO/PROJECT_FOLDER \ --indices=AACCGTAA,CTAAACGG,GGTTTACT,TCGGCGTC |
How did I get here?
It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.
10x Genomics pipelines need files named in the bcl2fastq or demux convention in order to run properly. You will need to determine which file corresponds to which sample and which read type, likely by consulting your sequencing core or the individual who demultiplexed your flow cell.
It is highly likely that these files were initially processed with bcl2fastq, so you will need to rename the files in one of the following formats, once you track down their origin:
[Sample Name]
_S1_L00[Lane Number]
_[Read
Type]
_001.fastq.gz
Where Read Type
is one of:
I1
: Dual index i7 read (optional)R1
: Read 1R2
: Dual index i5 readR3
: Read 2Alternatively, Cell Ranger ATAC will also accept ATAC FASTQs in this format:
I1
: Dual index i7 read (optional)R1
: Read 1I2
: Dual index i5 readR2
: Read 2After you have renamed those files into one of those formats, use the following arguments:
Situation | Argument+Value |
---|---|
All samples | --fastqs=/PATH/TO/PROJECT_FOLDER |
Process SAMPLENAME from all lanes | --fastqs=/PATH/TO/PROJECT_FOLDER \ --sample=SAMPLENAME |
Process SAMPLENAME from lane 1 only | --sample=SAMPLENAME \ --fastqs=/PATH/TO/PROJECT_FOLDER \ --lanes=1 |