HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell ATAC

Cell Ranger ATAC Genome References

Table of Contents

Overview

The reference data for Cell Ranger ATAC pipelines consists of the reference genome sequence and its associated genome annotation, which includes gene and transcript coordinates. The genome sequences and annotations can be obtained from reputable, well-established consortia such as NCBI, GENCODE, Ensembl and ENCODE. We provide pre-built single and mixed species references described in the next section, as well as a command-line tool mkref to build references that are not pre-built.

Pre-built Standard References

We provide the following pre-built references on the downloads page.

Standard single species reference packages:

Standard multi-species reference packages:

These are made by taking the union of reference sequences and annotations from individual single species pre-built references.

Arguments and Options

cellranger-atac 1.2.0 supports building single species references using mkref.

ParameterFunction
GENOME(Required) Name of the genome reference. New reference will be built as a new directory named GENOME under the current working directory.
--config(Optional for standard references) Configuration file to build a custom reference. Ignored when GENOME is one of the standard references: hg19, b37, GRCh38 or mm10.

Building with mkref

To build a custom reference, a configuration file specifying the source for genome sequences and annotations as well as contigs present in the genome is required (more on this in the configuration file requirements). The following is an example config file fly_BDGP6.config for building a reference for Drosophila melanogaster.

{
	GENOME_FASTA_INPUT: "ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.25_FB2018_06/fasta/dmel-all-chromosome-r6.25.fasta.gz",
	GENE_ANNOTATION_INPUT: "ftp://ftp.ensembl.org/pub/release-95/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.95.gtf.gz",
	MOTIF_INPUT: "http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_insects_non-redundant_pfms_jaspar.txt",
	ORGANISM: "Drosophila melanogaster",
	PRIMARY_CONTIGS: ["2L", "2R", "3L", "3R", "4", "X", "Y"],
	NON_NUCLEAR_CONTIGS: ["mitochondrion_genome"]
}

To build the reference, run mkref:

$ cd /home/jdoe/ref
$ cellranger-atac mkref fly_BDGP6 --config fly_BDGP6.config 
 
Non-standard genome name detected, building custom reference...
 
>>> Creating reference for fly_BDGP6 <<<
 
Creating new reference folder at /home/jdoe/ref/fly_BDGP6
Downloading fasta files from source...
done
 
Generating samtools index...
done
 
Generating pyfasta indexes...
    Number of contigs: 1870
    Total genome size: 143726002
done
 
Downloading gene annotation files from source...
done
 
Writing TSS and transcripts bed file...
    Parsed 23541 unique TSS and 28827 unique transcripts.
done
 
Generating bwa index (may take over an hour for a 3Gb genome)...
[bwa_index] Pack FASTA... 1.23 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=287452004, availableWord=32225820
[BWTIncConstructFromPacked] 10 iterations done. 53158068 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 98205524 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 138239796 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 173818340 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 205436596 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 233534948 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 258504820 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 280694052 characters processed.
[bwt_gen] Finished constructing BWT in 84 iterations.
[bwa_index] 94.16 seconds elapse.
[bwa_index] Update BWT... 0.93 sec
[bwa_index] Pack forward-only FASTA... 0.74 sec
[bwa_index] Construct SA from BWT and Occ... 33.75 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /home/jdoe/ref/fly_BDGP6/fasta/genome.fa
[main] Real time: 131.225 sec; CPU: 130.816 sec
done
 
Downloading pfm files from source...
done
 
Finishing up...
>>> Reference successfully created! <<<

System Requirements

Indexing is the computational bottleneck in building references for Cell Ranger ATAC. Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory.

Configuration file

For building custom references, you must supply a configuration file like the drosophila example shown in Building with mkref section. The example file is written in "human readable" JSON format, though a strictly formatted JSON is perfectly acceptable. Below is a table of required input keys for the configuration file. Each key is provided with a value that must satisfy type constraints specified in the second column. There are format requirements on the values, for example if the value is a url pointing to a file or a file path.

Required Input KeysType RequirementsFormat Requirements
GENOME_FASTA_INPUTvalid url or file path
  • Must be in valid fasta format
  • Must contain all contigs listed in PRIMARY_CONTIGS and NON_NUCLEAR_CONTIGS
GENE_ANNOTATION_INPUTvalid url or file path
  • Must be in valid GTF or GFF3 format.
  • Must contain all contigs listed in PRIMARY_CONTIGS and NON_NUCLEAR_CONTIGS.
  • All entries must match the contig lengths defined in GENOME_FASTA_INPUT.
  • Must contain "transcript" or "mRNA" in the third column.
  • For each row with "transcript" or "mRNA" in the third column, must have "gene_name" defined in the attribute column for GTF format, or "Name" for GFF3 format. This is to denote the common name of the gene. Alternatives such as "gene_symbol" will not be accepted.
  • Header lines starting with "#" are allowed in GTF or GFF3 input. However, comment lines starting with "#" in between GTF or GFF3 record rows will not be properly handled.
  • (Optional) contain "gene_type" field, used for filtering in peak annotation in which we only annotate "protein_coding" genes and genes coded for VDJ segment in pre-built genomes. When not present, no filtering will be applied.
MOTIF_INPUTvalid url or file path. Use "" to indicate it as not available.
  • Must be in valid JASPAR format. For example:
    • JASPAR 2010 matrix_only format:

       >MA0001.1 AGL3
       A  [ 0  3 79 40 66 48 65 11 65  0 ]
       C  [94 75  4  3  1  2  5  2  3  3 ]
       G  [ 1  0  3  4  1  0  5  3 28 88 ]
       T  [ 2 19 11 50 29 47 22 81  1  6 ]
      
    • JASPAR 2010-2014 PFMs format:

       >MA0001.1 AGL3
       0       3       79      40      66      48      65      11      65      0
       94      75      4       3       1       2       5       2       3       3
       1       0       3       4       1       0       5       3       28      88
       2       19      11      50      29      47      22      81      1       6
      
    • The expected naming scheme of the motif is "motif ID" and "gene name" separated by a tab.
PRIMARY_CONTIGSlist Must be within the bracket `[]` and each contig must be within quote "". Note that PRIMARY_CONTIGS cannot be an empty list.
NON_NUCLEAR_CONTIGSlist Must be within the bracket `[]` and each contig must be within quote "". Use empty brackets `[]` for specifying empty list.
ORGANISMstring Can be left empty as "". If provided, it will be displayed on the summary html file.

 

Advanced: file structure in a reference

A single species reference compatible with the Cell Ranger ATAC pipelines has the following file structure:

$ tree /home/jdoe/ref
/home/jdoe/ref
├── fasta
│   ├── contig-defs.json    [required, input]
│   ├── genome.fa           [required, input, for pre-built references, sources: NCBI]
│   ├── genome.fa.amb       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.ann       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.bwt       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.fai       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.flat      [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.gdx       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.pac       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   └── genome.fa.sa        [required, derived from genome.fa using samtool faidx, bwa, pysam]
├── genes
│   ├── genes.gtf           [required, input, GENCODE sources for pre-built references: hg19, b37, GRCh38 and mm10]
│   └── regulatory.gff      [pre-built references only, Ensembl sources: hg19, b37, GRCh38 and mm10]
├── genome                  [required, input]
├── metadata.json           [required, input]
└── regions
    ├── blacklist.bed       [pre-built references only, ENCODE sources: hg19, b37, GRCh38, mm10]
    ├── ctcf.bed            [pre-built references only]
    ├── dnase.bed           [pre-built references only, ENCODE sources: hg19, b37, mm10, Anshul Kundaje's pipeline: GRCh38]
    ├── enhancer.bed        [pre-built references only, source: Ensembl regulatory build release 95]
    ├── promoter.bed        [pre-built references only, source: Ensembl regulatory build release 95]
    ├── motifs.pfm          [optional, input, source for pre-built references: JASPAR vertebrate non-redundant collection] 
    ├── transcripts.bed     [required for 1.1 and later references, derived from transcript coordinates in genes.gtf]
    └── tss.bed             [required, derived from first nt position of each transcript in genes.gtf]

The required files mentioned above are the minimal set of files required to create a directory structure compatible with Cell Ranger ATAC pipelines. Some required files are specified as part of input in the config file described in the configuration file requirements section. Other required files are derived by processing a required input file. The regulatory and functional domain files such as promoter.bed are present only in the pre-built references. The transcripts.bed is a derived file not present in 1.0 references but the 1.2.0 pipelines are backwards compatible with old 1.0 references. Note that mkref recognizes four keywords (hg19,b37,mm10,GRCh38) and running cellranger-atac mkref will create our pre-built references.