Cell Ranger DNA1.0, printed on 11/19/2024
Cell Ranger DNA provides pre-built GRCh38 (human), GRCh37 (human) and GRCm38 (mouse) reference packages for use with the pipeline. These references come packaged with GENCODE annotations. Following current conventions, these references have the following properties:
At this time, only these human and mouse references have seen extensive testing.
Cell Ranger DNA comes packaged with the command mkref
which will construct a reference. mkref
requires a single FASTA file containing the reference genome sequence and contig_defs.json
file. Any alternate haplotype sequence records should be omitted from the FASTA file since these will result in a greater fraction of the genome being unmappable. An optional GTF may also be provided, which will be used solely to provide gene annotations for visualization using Loupe scDNA Browser:
$ cellranger-dna mkref <fasta_file> <contig_defs_file> [--gtf=<file.gtf.gz>]
After this process has completed, there should be a new folder called refdata-$GENOME
(where $GENOME
is the FASTA filename without the filetype suffix) in the current directory with the following structure:
$ tree refdata-$GENOME ├── fasta │ ├── genome.fa │ ├── genome.fa.amb │ ├── genome.fa.ann │ ├── genome.fa.bwt │ ├── genome.fa.fai │ ├── genome.fa.flat │ ├── genome.fa.gdx │ ├── genome.fa.pac │ └── genome.fa.sa │ └── contig-defs.json ├── genes ├── genome ├── regions └── snps
A contig_defs.json
file must be provided. This file is necessary to determine which contigs in the input FASTA are considered primary contigs, as well as keeping track of sex-chromosomes, and non-nuclear sequences.
The contig_defs.json
file has the following keys:
species_prefixes
: This field is not currently used and can be omitted. If this field is present, the contig names must match the species prefixes in the pattern ${prefix}_${contig}
. As an example, if species_prefixes
is ["GRCh38"]
, then primary_contigs
would have the names ["GRCh38_chr1", ...]
.primary_contigs
: A list of primary contigs. Copy number variants will only be called on primary contigs. Primary contigs should be at least 10 megabases in length.sex_chromosomes
: A key-value list defining expected copy number for sex chromosomes in the male and female case. This field is not currently used and may be omitted.non_nuclear_contigs
: A list of non-nuclear contigs such as the mitochondrial sequence. This field is not currently used and may be omitted.As an example, here is the contig_defs.json
that comes packaged with the GRCh38 reference:
{ "species_prefixes": [""], "primary_contigs": [ "chr1", "chr2", "chr3","chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY" ], "sex_chromosomes": { "_male": { "chrX": 1, "chrY": 1 }, "_female": { "chrX": 2, "chrY": 0 } }, "non_nuclear_contigs": ["chrM"] }
A GTF file may be provided to mkref
to facilitate visualization in Loupe scDNA Browser. This file may be provided to mkref
gzipped or not, and only annotations with gene_type or gene_biotype attribute of protein_coding
or pseudogene
are considered.