Cell Ranger ARC2.0, printed on 11/03/2024
Cell Ranger ARC provides pre-built human (GRCh38) and mouse (mm10) reference packages for use with cellranger-arc count. To create and use a custom reference package, Cell Ranger ARC requires a reference genome sequence (FASTA file) and gene annotations (GTF file). Optionally transcription factor motifs can be specified in JASPAR format.
Cell Ranger ARC supports the use of customer-generated references under the following conditions:
_
). The aggr pipeline reads characters after an _
as a species identifier. If that is not the case, an aggr run error occurs.We outline the steps to create a reference package starting from a reference genome and a set of gene annotations.
FASTA and GTF files can be downloaded from sites like ENSEMBL and UCSC. The downloaded files are typically compressed. They must be uncompressed in order to process them in subsequent steps. As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:
This step is optional. Any gene that is contained in the GTF file will end up in the final count matrix and analysis. If a GTF contains a low-confidence gene annotation that overlaps with a high-confidence protein coding gene then the pipeline will be unable to uniquely associate a UMI from the overlapping region with either gene. As a consequence that UMI count will be "wasted". Similarly, on the ATAC side filtering the GTF down could potentially make the peaks more interpretable. GTF files downloaded from sites like Ensembl and UCSC often contain transcripts and genes which need to be filtered from your final annotation. Some examples of filters may include
Restricting to one or more classes of genes: GTF files often contain a field
like gene_biotype
or gene_type
labelling a gene class as protein-coding
or lincRNA
etc
Removing genes from the Pseudo-Autosomal Region
Removing low-confidence transcripts
See the filters used for the pre-built GRCh38 and mm10 references.
Cell Ranger ARC provides mkgtf
, a simple utility to filter genes based on their
key-value pairs in the GTF attribute column.
$ cellranger-arc mkgtf input.gtf output.gtf --attribute=key:allowable_value
For example, the following filtering will restrict a GRCh38 Ensembl GTF to genes of type protein-coding, lincRNA, antisense, and immune-related:
$ cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lincRNA \ --attribute=gene_biotype:antisense \ --attribute=gene_biotype:IG_LV_gene \ --attribute=gene_biotype:IG_V_gene \ --attribute=gene_biotype:IG_V_pseudogene \ --attribute=gene_biotype:IG_D_gene \ --attribute=gene_biotype:IG_J_gene \ --attribute=gene_biotype:IG_J_pseudogene \ --attribute=gene_biotype:IG_C_gene \ --attribute=gene_biotype:IG_C_pseudogene \ --attribute=gene_biotype:TR_V_gene \ --attribute=gene_biotype:TR_V_pseudogene \ --attribute=gene_biotype:TR_D_gene \ --attribute=gene_biotype:TR_J_gene \ --attribute=gene_biotype:TR_J_pseudogene \ --attribute=gene_biotype:TR_C_gene
This command will generate a filtered GTF file
Homo_sapiens.GRCh38.ensembl.filtered.gtf
from the original
unfiltered GTF file Homo_sapiens.GRCh38.ensembl.gtf
. In the output
file, other biotypes such as gene_biotype:pseudogene
are excluded
from the GTF annotation.
This step is optional. When a motifs file in JASPAR format is supplied the pipeline generates additional transcription factor analyses as described here. If these analyses are of interest to you you can download transcription factor motif position-weight matrices in JASPAR format, for example, from JASPAR 2022. The JASPAR format specifies a motif name using a FASTA-style header line followed by the position-weight matrix. Here is an example:
>Arnt_MA0004.1 A [ 4 19 0 0 0 0 ] C [ 16 0 20 0 0 0 ] G [ 0 1 0 20 0 20 ] T [ 0 0 0 0 20 0 ]
The pre-built GRCh38 and
mm10 references utilize JASPAR
vertebrate, non-redundant motifs that can be downloaded from JASPAR
2022.
Note: the motif headers are modified such that >MOTIF_ID\tMOTIF_NAME
is turned
into >MOTIF_NAME_MOTIF_ID
. This modification allows for better readability of
the motif analysis results.
cellranger-arc mkref takes as input a configuration file that bundles various inputs to the tool. We explain how to construct a configuration file using the example of GRCh38:
{ organism: "human" genome: ["GRCh38"] input_fasta: ["/path/to/GRCh38/assembly.fa"] input_gtf: ["/path/to/gencode/annotation.gtf"] non_nuclear_contigs: ["chrM"] input_motifs: "/path/to/jaspar/motifs.pfm" }
Each line consists of a key: value
and note that some fields are
plain strings enclosed by double quotes ""
, other fields are
filesystem paths that are also enclosed by double quotes ""
, and
finally, some parameters are lists of strings/paths enclosed by square brackets
[]
. The individual parameter fields are described below:
Parameter | Function |
---|---|
organism |
Optional; string. Name of the organism. This is displayed in the web summary but is otherwise not used in the analysis. |
genome |
Required; list of strings. Name(s) of the genome(s) that comprise the organism. Note: Cell Ranger ARC only supports single-species references so this list should be of length 1. The reference package is constructed in the current working directory where the directory name is the name of the genome. In the example above the reference package would be constructed in $(pwd)/GRCh38 . |
input_fasta |
Required; list of paths. Path(s) to the assembly FASTA file(s) for each genome in uncompressed FASTA format. Note: Cell Ranger ARC only supports single-species references so this list should be of length 1. |
input_gtf |
Required; list of paths. Path(s) to the gene annotation GTF file(s) for each genome in GTF format. Note: Cell Ranger ARC only supports single-species references so this list should be of length 1. |
non_nuclear_contigs |
Optional; list of strings. Name(s) of contig(s) that do not have any chromatin structure, for example, mitochondria or plastids. For the GRCh38 assembly this would be ["chrM"] . These contigs are excluded from peak calling since the entire contig will be "open" due to a lack of chromatin structure. |
input_motifs |
Optional; path. Path to file containing transcription factor motifs in JASPAR format (see above). Note: any spaces in the header name are converted to a single underscore. For ease of use in Loupe, we recommend using a header that begins with a human-readable name rather than a motif identifier. |
To create the reference package, use the cellranger-arc
mkref command, passing it one or more matching sets of FASTA and GTF
files. This utility copies your FASTA and GTF, indexes these in several formats,
and outputs a folder with the name you pass to genome
in the config file. Input GTF
files are typically filtered with mkgtf prior to
mkref.
Argument | Description |
---|---|
--config |
Required. Path to a configuration file containing additional information about the reference. See above for more details. |
--memgb | Optional. Maximum memory (GB) used during STAR genome index generation. Defaults to 16. Please note, the amount of memory specified must be greater than the number of gigabases in the input reference FASTA file. |
--ref-version |
Optional. Reference version string to include with reference. |
--nthreads | Optional. Number of threads used during STAR genome index generation. Defaults to 1. |
--help or -h | Optional. Show list of all arguments and options. |
--version | Optional. Show version. |
To build a reference, run mkref as illustrated below:
$ cellranger-arc mkref --config=/home/jdoe/10x_references/GRCh38.config >>> Creating reference for GRCh38 <<< Creating new reference folder at /home/jdoe/10x_references/GRCh38 ...done Writing genome FASTA file into reference folder... ...done Indexing genome FASTA file... ...done Writing genes GTF file into reference folder... ...done Generating STAR genome index (may take over 8 core hours for a 3Gb genome)... MM DD hh:mm:ss ..... started STAR run MM DD hh:mm:ss ... starting to generate Genome files MM DD hh:mm:ss ... starting to sort Suffix Array. This may take a long time... MM DD hh:mm:ss ... sorting Suffix Array chunks and saving them to disk... MM DD hh:mm:ss ... loading chunks from disk, packing SA... MM DD hh:mm:ss ... finished generating suffix array MM DD hh:mm:ss ... generating Suffix Array index MM DD hh:mm:ss ... completed Suffix Array index MM DD hh:mm:ss ..... processing annotations GTF MM DD hh:mm:ss ..... inserting junctions into the genome indices MM DD hh:mm:ss ... writing Genome to disk ... MM DD hh:mm:ss ... writing Suffix Array to disk ... MM DD hh:mm:ss ... writing SAindex to disk MM DD hh:mm:ss ..... finished successfully ...done. Writing genome metadata JSON file into reference folder... Computing hash of genome FASTA file... ...done Computing hash of genes GTF file... ...done ...done Generating bwa index (may take over an hour for a 3Gb genome)... [bwa_index] Pack FASTA... 0.13 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 4.05 seconds elapse. [bwa_index] Update BWT... 0.06 sec [bwa_index] Pack forward-only FASTA... 0.09 sec [bwa_index] Construct SA from BWT and Occ... 1.80 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index /home/jdoe/10x_references/GRCh38/fasta/genome.fa [main] Real time: 6.206 sec; CPU: 6.137 sec done Writing TSS and transcripts bed file... ...done Writing genome metadata JSON file into reference folder... Computing hash of genome FASTA file... ...done Computing hash of genes GTF file... ...done ...done \>>> Reference successfully created at GRCh38 <<<
The created reference package will contain the following files:
GRCh38 ├── fasta │ ├── genome.fa │ ├── genome.fa.amb │ ├── genome.fa.ann │ ├── genome.fa.bwt │ ├── genome.fa.fai │ ├── genome.fa.pac │ └── genome.fa.sa ├── genes │ └── genes.gtf.gz ├── reference.json ├── regions │ ├── motifs.pfm # present if motifs file was supplied │ ├── transcripts.bed │ └── tss.bed └── star ├── chrLength.txt ├── chrNameLength.txt ├── chrName.txt ├── chrStart.txt ├── exonGeTrInfo.tab ├── exonInfo.tab ├── geneInfo.tab ├── Genome ├── genomeParameters.txt ├── SA ├── SAindex ├── sjdbInfo.txt ├── sjdbList.fromGTF.out.tab ├── sjdbList.out.tab └── transcriptInfo.tab
Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory. Note that STAR reference generation can only be run on one thread due to technical reasons. We expect to remedy this in a subsequent release.
The only difference between a reference constructed using cellranger-atac mkref and cellranger-arc mkref is that Cell Ranger ARC references contain a genome index for the splice-aware STAR aligner, which is used to compute alignments of gene expression reads. An ARC reference can be used with Cell Ranger ATAC. But the reverse is not true, i.e., an ATAC reference cannot be used with Cell Ranger ARC.
Provided that you follow the format described above, it is fairly simple to add
custom gene definitions to an existing reference package constructed using
cellranger-arc mkref. If we assume that the reference
package is located in REF_DIR
, the FASTA sequence records are stored in
REF_DIR/fasta/genome.fa
and the gene annotations are compressed and stored in
REF_DIR/genes/genes.gtf.gz
. First, create a new FASTA reference file by adding
any additional contigs for the new genes to REF_DIR/fasta/genome.fa
if needed.
Next create a new GTF file by uncompressing REF_DIR/genes/genes.gtf.gz
and
then appending records for each new gene. Note that the new genes must have GTF
features of type 'exon' for each exon and 'transcript' for each transcript.
The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.
Column | Name | Description |
---|---|---|
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | cellranger-arc count requires the presence of an 'exon' row for each exon and a 'transcript' row for each transcript. |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys transcript_id and gene_id are required. |
After adding the necessary records to your FASTA file and the additional lines to your GTF file, run cellranger-arc mkref as described above.