Cell Ranger ARC1.0, printed on 10/13/2024
The cellranger-arc count pipeline requires ATAC and GEX
FASTQ files as input, which typically come from running
cellranger-arc mkfastq
, a 10x-aware convenience wrapper for
bcl2fastq
. However, it is possible to use FASTQ files from other sources,
such as Illumina's bcl2fastq
, a published dataset, or our
bamtofastq
. Input FASTQ files must conform to the naming
conventions of bcl2fastq and mkfastq for
cellranger-arc count to successfully complete. These files
are specified using a libraries CSV file and passed to the
cellranger-arc count pipeline using the --libraries
argument.
The cellranger-arc count pipeline can process data from one Multiome ATAC library and one Multiome GEX library, each of which could be sequenced on multiple flow cells. Multi-library analysis is not possible at this time. |
There are multiple ways bcl2fastq and mkfastq can be invoked, resulting in a wide range of potential file names and locations as output. Since finding the right FASTQ files to process and the right arguments to process those files as desired can be confusing, we will illustrate some common scenarios below.
How did I get here?
By running cellranger-arc mkfastq with a simple CSV layout file or Illumina Experiment Manager samplesheet, or by running bcl2fastq directly (with an IEM samplesheet) on a flow cell.
Your files will be in a (MKFASTQ_ID)/outs/fastq_path
folder, and
the file hierarchy may look similar to this:
MKFASTQ_ID |-- MAKE_FASTQS_CS `-- outs |-- fastq_path |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_I2_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_I2_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_I2_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | `-- test_sample1_S1_L003_R2_001.fastq.gz |-- test_sample2 | |-- test_sample2_S2_L001_I1_001.fastq.gz | |-- test_sample2_S2_L001_I2_001.fastq.gz | |-- test_sample2_S2_L001_R1_001.fastq.gz | |-- test_sample2_S2_L001_R2_001.fastq.gz | |-- test_sample2_S2_L002_I1_001.fastq.gz | |-- test_sample2_S2_L002_I2_001.fastq.gz | |-- test_sample2_S2_L002_R1_001.fastq.gz | |-- test_sample2_S2_L002_R2_001.fastq.gz | |-- test_sample2_S2_L003_I1_001.fastq.gz | |-- test_sample2_S2_L003_I2_001.fastq.gz | |-- test_sample2_S2_L003_R1_001.fastq.gz | `-- test_sample2_S2_L003_R2_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz
Your file hierarchy may look similar to this:
BCL2FASTQ_OUTPUT_DIR |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_I2_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_I2_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_I2_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | `-- test_sample1_S1_L003_R2_001.fastq.gz |-- test_sample2 | |-- test_sample2_S2_L001_I1_001.fastq.gz | |-- test_sample2_S2_L001_I2_001.fastq.gz | |-- test_sample2_S2_L001_R1_001.fastq.gz | |-- test_sample2_S2_L001_R2_001.fastq.gz | |-- test_sample2_S2_L002_I1_001.fastq.gz | |-- test_sample2_S2_L002_I2_001.fastq.gz | |-- test_sample2_S2_L002_R1_001.fastq.gz | |-- test_sample2_S2_L002_R2_001.fastq.gz | |-- test_sample2_S2_L003_I1_001.fastq.gz | |-- test_sample2_S2_L003_I2_001.fastq.gz | |-- test_sample2_S2_L003_R1_001.fastq.gz | `-- test_sample2_S2_L003_R2_001.fastq.gz ...
You will have one set of fastq files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM samplesheet.
For more information on the naming conventions, please visit Illumina's support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.
The table below describes the line in the libraries CSV file you would use in the corresponding scenario. Be sure to substitute the capitalized text as appropriate. The "All Samples" entries in this table are provided for technical completeness.
Situation | Line in libraries CSV |
---|---|
All samples (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,,Gene Expression ... |
All samples (mkfastq), multiple flow cells | fastqs,sample,library_type /PATH/TO/MKFASTQ_FLOWCELL1/outs/fastq_path,,Gene Expression /PATH/TO/MKFASTQ_FLOWCELL2/outs/fastq_path,,Gene Expression ... |
All samples (bcl2fastq direct) | fastqs,sample,library_type /PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Gene Expression ... |
Process test_sample1 (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Gene Expression ... |
Process test_sample1 and test_sample2 as a single merged sample (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Gene Expression /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample2,Gene Expression ... |
How did I get here?
An Illumina Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy may look similar to this:
fastq_path |-- Reports |-- Stats |-- test_sample_S1_L001_I1_001.fastq.gz |-- test_sample_S1_L001_I2_001.fastq.gz |-- test_sample_S1_L001_R1_001.fastq.gz |-- test_sample_S1_L001_R2_001.fastq.gz |-- test_sample_S1_L002_I1_001.fastq.gz |-- test_sample_S1_L002_I2_001.fastq.gz |-- test_sample_S1_L002_R1_001.fastq.gz |-- test_sample_S1_L002_R2_001.fastq.gz |-- test_sample_S1_L003_I1_001.fastq.gz |-- test_sample_S1_L003_I2_001.fastq.gz |-- test_sample_S1_L003_R1_001.fastq.gz |-- test_sample_S1_L003_R2_001.fastq.gz |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R2_001.fastq.gz
This is fine; you would use the same arguments as if the FASTQs were organized into subfolders within the output folder.
Situation | Line in libraries CSV |
---|---|
All samples (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,,Gene Expression ... |
All samples (bcl2fastq direct) | fastqs,sample,library_type /PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Gene Expression ... |
Process test_sample only (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample,Gene Expression ... |
How did I get here?
It is likely that FASTQ files have been transferred from either a
mkfastq
or bcl2fastq
run into another folder. They
still retain the names assigned by bcl2fastq
, which is a
combination of sample name, sample order, lane, read type, and chunk. Your file
hierarchy may look like this:
PROJECT_FOLDER |-- MySample_S1_L001_I1_001.fastq.gz |-- MySample_S1_L001_I2_001.fastq.gz |-- MySample_S1_L001_R1_001.fastq.gz |-- MySample_S1_L001_R2_001.fastq.gz |-- MySample_S1_L002_I1_001.fastq.gz |-- MySample_S1_L002_I2_001.fastq.gz |-- MySample_S1_L002_R1_001.fastq.gz |-- MySample_S1_L002_R2_001.fastq.gz
This is fine; since the files are named according to the bcl2fastq
standard, you would use the same arguments as if the FASTQs were organized into
a flow cell folder or mkfastq
output folder.
Situation | Line in libraries CSV |
---|---|
All samples |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,,Gene Expression ... |
Process MySample only |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,MySample,Gene Expression ... |
How did I get here?
It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.
10x pipelines require files to be named in the bcl2fastq
convention in order
to run properly. You will need to determine the corresponding sample and read type for each file, likely by consulting your sequencing core or the
individual who demultiplexed your flow cell.
It is highly likely that these files were initially processed with
bcl2fastq
. Once you track the origin of the file, you will rename the files in the following
format:
[Sample Name]
_S1_L00[Lane Number]
_[Read
Type]
_001.fastq.gz
Where Read Type
is one of:
I1
: Dual index i7 read (optional)I2
: Dual index i5 read (optional)R1
: Read 1R2
: Read 2After the files have been renamed in the specified format, you will use the following arguments:
Situation | Line in libraries CSV |
---|---|
All samples |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,,Gene Expression ... |
Process SAMPLENAME only |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,SAMPLENAME,Gene Expression ... |
How did I get here?
By running cellranger-arc mkfastq with a simple CSV layout file or Illumina Experiment Manager samplesheet, or by running bcl2fastq directly (with an IEM samplesheet) on a flow cell.
Your files will be in a (MKFASTQ_ID)/outs/fastq_path
folder, and
your file hierarchy may look similar to this:
MKFASTQ_ID |-- MAKE_FASTQS_CS `-- outs |-- fastq_path |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L001_R3_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L002_R3_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | |-- test_sample1_S1_L003_R2_001.fastq.gz | `-- test_sample1_S1_L003_R3_001.fastq.gz |-- test_sample2 | |-- test_sample2_S1_L001_I1_001.fastq.gz | |-- test_sample2_S1_L001_R1_001.fastq.gz | |-- test_sample2_S1_L001_R2_001.fastq.gz | |-- test_sample2_S1_L001_R3_001.fastq.gz | |-- test_sample2_S1_L002_I1_001.fastq.gz | |-- test_sample2_S1_L002_R1_001.fastq.gz | |-- test_sample2_S1_L002_R2_001.fastq.gz | |-- test_sample2_S1_L002_R3_001.fastq.gz | |-- test_sample2_S1_L003_I1_001.fastq.gz | |-- test_sample2_S1_L003_R1_001.fastq.gz | |-- test_sample2_S1_L003_R2_001.fastq.gz | `-- test_sample2_S1_L003_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R3_001.fastq.gz
Your file hierarchy may look similar to this:
BCL2FASTQ_OUTPUT_DIR |-- HFLC5BBXX |-- test_sample1 | |-- test_sample1_S1_L001_I1_001.fastq.gz | |-- test_sample1_S1_L001_R1_001.fastq.gz | |-- test_sample1_S1_L001_R2_001.fastq.gz | |-- test_sample1_S1_L001_R3_001.fastq.gz | |-- test_sample1_S1_L002_I1_001.fastq.gz | |-- test_sample1_S1_L002_R1_001.fastq.gz | |-- test_sample1_S1_L002_R2_001.fastq.gz | |-- test_sample1_S1_L002_R3_001.fastq.gz | |-- test_sample1_S1_L003_I1_001.fastq.gz | |-- test_sample1_S1_L003_R1_001.fastq.gz | |-- test_sample1_S1_L003_R2_001.fastq.gz | `-- test_sample1_S1_L003_R3_001.fastq.gz |-- test_sample2 | |-- test_sample2_S1_L001_I1_001.fastq.gz | |-- test_sample2_S1_L001_R1_001.fastq.gz | |-- test_sample2_S1_L001_R2_001.fastq.gz | |-- test_sample2_S1_L001_R3_001.fastq.gz | |-- test_sample2_S1_L002_I1_001.fastq.gz | |-- test_sample2_S1_L002_R1_001.fastq.gz | |-- test_sample2_S1_L002_R2_001.fastq.gz | |-- test_sample2_S1_L002_R3_001.fastq.gz | |-- test_sample2_S1_L003_I1_001.fastq.gz | |-- test_sample2_S1_L003_R1_001.fastq.gz | |-- test_sample2_S1_L003_R2_001.fastq.gz | `-- test_sample2_S1_L003_R3_001.fastq.gz ...
You will have one set of fastq files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM samplesheet. Other situations described later on this page deal with the presence of four separate sets of files (four "samples" from bcl2fastq's point of view) per single biological sample/library.
For more information on the naming conventions, please visit Illumina's support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.
The table below describes the line in the libraries CSV file you would use in the corresponding scenario. Be sure to substitute the capitalized text as appropriate. The "All Samples" entries in this table are provided for technical completeness.
Situation | Line in libraries CSV |
---|---|
All samples (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility ... |
All samples (mkfastq), multiple flow cells | fastqs,sample,library_type /PATH/TO/MKFASTQ_FLOWCELL1/outs/fastq_path,,Chromatin Accessibility /PATH/TO/MKFASTQ_FLOWCELL2/outs/fastq_path,,Chromatin Accessibility ... |
All samples (bcl2fastq direct) | fastqs,sample,library_type /PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Chromatin Accessibility ... |
Process test_sample1 (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Chromatin Accessibility ... |
Process test_sample1 and test_sample2 as a single merged sample (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample1,Chromatin Accessibility /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample2,Chromatin Accessibility ... |
How did I get here?
It is likely that the input samplesheet used explicitly separated the four oligos in a 10x Genomics sample index set into four separate sample names. You may see a file hierarchy similar to this:
bcl2fastq_output |-- HFLC5BBXX |-- SI-GA-A1_1 | |-- SI-GA-A1_1_S1_L001_I1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R1_001.fastq.gz | |-- SI-GA-A1_1_S1_L001_R2_001.fastq.gz | `-- SI-GA-A1_1_S1_L001_R3_001.fastq.gz |-- SI-GA-A1_2 | |-- SI-GA-A1_2_S2_L001_I1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R1_001.fastq.gz | |-- SI-GA-A1_2_S2_L001_R2_001.fastq.gz | `-- SI-GA-A1_2_S2_L001_R3_001.fastq.gz |-- SI-GA-A1_3 | |-- SI-GA-A1_3_S3_L001_I1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R1_001.fastq.gz | |-- SI-GA-A1_3_S3_L001_R2_001.fastq.gz | `-- SI-GA-A1_3_S3_L001_R3_001.fastq.gz |-- SI-GA-A1_4 | |-- SI-GA-A1_4_S4_L001_I1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R1_001.fastq.gz | |-- SI-GA-A1_4_S4_L001_R2_001.fastq.gz | `-- SI-GA-A1_4_S4_L001_R3_001.fastq.gz |-- Reports |-- Stats |-- Undetermined_S0_L001_I1_001.fastq.gz |-- Undetermined_S0_L001_R1_001.fastq.gz |-- Undetermined_S0_L001_R2_001.fastq.gz `-- Undetermined_S0_L001_R3_001.fastq.gz
You probably want to be able to merge All samples from the SI-GA-A1
index into a single analysis. If you only run one index at a time, you will see
a smaller number of reads than expected, which may translate to lower than expected coverage
or cell count for the experiment.
Situation | Line in libraries CSV |
---|---|
All samples (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility ... |
Process all SI-GA-A1 reads in a single analysis |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_1,Chromatin Accessibility /PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_2,Chromatin Accessibility /PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_3,Chromatin Accessibility /PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_4,Chromatin Accessibility ... |
Only process first sample index | fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,SI-GA-A1_1,Chromatin Accessibility ... |
How did I get here?
An Illumina Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy may look similar to this:
fastq_path |-- Reports |-- Stats |-- test_sample_S1_L001_I1_001.fastq.gz |-- test_sample_S1_L001_R1_001.fastq.gz |-- test_sample_S1_L001_R2_001.fastq.gz |-- test_sample_S1_L001_R3_001.fastq.gz |-- test_sample_S1_L002_I1_001.fastq.gz |-- test_sample_S1_L002_R1_001.fastq.gz |-- test_sample_S1_L002_R2_001.fastq.gz |-- test_sample_S1_L002_R3_001.fastq.gz |-- test_sample_S1_L003_I1_001.fastq.gz |-- test_sample_S1_L003_R1_001.fastq.gz |-- test_sample_S1_L003_R2_001.fastq.gz |-- test_sample_S1_L003_R3_001.fastq.gz |-- Undetermined_S0_L001_I1_001.fastq.gz ... `-- Undetermined_S0_L003_R3_001.fastq.gz
This is fine; you would use the same arguments as if the FASTQs were organized into subfolders within the output folder.
Situation | Line in libraries CSV |
---|---|
All samples (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,,Chromatin Accessibility ... |
All samples (bcl2fastq direct) | fastqs,sample,library_type /PATH/TO/BCL2FASTQ_OUTPUT_DIR,,Chromatin Accessibility ... |
Process test_sample only (mkfastq) |
fastqs,sample,library_type /PATH/TO/MKFASTQ_ID/outs/fastq_path,test_sample,Chromatin Accessibility ... |
How did I get here?
It is likely that FASTQ files have been transferred from either a
mkfastq
or bcl2fastq
run into another folder. They
still retain the names assigned by bcl2fastq
, which is a
combination of sample name, sample order, lane, read type, and chunk. Your file
hierarchy may look similar to this:
PROJECT_FOLDER |-- MySample_S1_L001_I1_001.fastq.gz |-- MySample_S1_L001_I2_001.fastq.gz |-- MySample_S1_L001_R1_001.fastq.gz |-- MySample_S1_L001_R2_001.fastq.gz |-- MySample_S1_L002_I1_001.fastq.gz |-- MySample_S1_L002_I2_001.fastq.gz |-- MySample_S1_L002_R1_001.fastq.gz |-- MySample_S1_L002_R2_001.fastq.gz
This is fine; since the files are named according to the bcl2fastq
standard, you would use the same arguments as if the FASTQs were organized into
a flow cell folder or mkfastq
output folder.
Situation | Line in libraries CSV |
---|---|
All samples |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,,Chromatin Accessibility ... |
Process MySample only |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,MySample,Chromatin Accessibility ... |
How did I get here?
It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.
10x pipelines require files to be named in the bcl2fastq
convention in order
to run properly. You will need to determine the corresponding sample and read type for each file, likely by consulting your sequencing core or the
individual who demultiplexed your flow cell.
It is highly likely that these files were initially processed with
bcl2fastq
, so you will need to rename the files in the following
format, once you track down their origin:
[Sample Name]
_S1_L00[Lane Number]
_[Read
Type]
_001.fastq.gz
Where Read Type
is one of:
I1
: Dual index i7 read (optional)R1
: Read 1R2
: Dual index i5 read (optional)R3
: Read 2After you have renamed those files into that format, you'll use the following arguments:
Situation | Line in libraries CSV |
---|---|
All samples |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,,Chromatin Accessibility ... |
Process SAMPLENAME only |
fastqs,sample,library_type /PATH/TO/PROJECT_FOLDER,SAMPLENAME,Chromatin Accessibility ... |