Software  ›   pipelines

# Using Custom References

Cell Ranger provides pre-built hg19, mm10, and ercc92 reference packages for use with the pipeline. If you would like to use your genome FASTA or gene GTF annotations, Cell Ranger supports the creation and use of custom references.

## Compatible Use Cases

Cell Ranger supports the use of customer-generated references under the following conditions:

• Your reference should have only a small number of overlapping gene annotations. Reads aligning non-uniquely to multiple genes causes the pipeline to detect fewer molecules.
• Your FASTA and GTF files must be compatible with the open source RNA-seq aligner, STAR.

## Making a Reference Package

In order to create a custom reference, you will start with a GTF file that has been filtered to contain only the genes of interest. In many cases, you can use cellranger mkgtf to do that filtering for you. The next step is to use cellranger mkref to index the FASTA and GTF files and create a reference package that is compatible with Cell Ranger.

1. Introduction to our tools, mkgtf and mkref.
2. Example: Generating our hg19 reference package
3. Special instructions for single-nuclei RNA

## Introduction to our Tools

### Using mkgtf

GTF files downloaded from sites like ENSEMBL and UCSC often contain transcripts and genes which need to be filtered from your final annotation. Usually, it is helpful to filter genes based on their key-value pairs in the GTF attribute column. Cell Ranger provides mkgtf, a simple utility to handle just such filtering. The format for this command is:

$cellranger mkgtf input.gtf output.gtf --attribute=key:allowable_value  For example, to filter for only protein-coding genes, run the following command. $ cellranger mkgtf hg19-ensembl.gtf hg19-filtered-ensembl.gtf --attribute=gene_biotype:protein_coding


This will generate a filtered GTF file hg19-filtered-ensembl.gtf from the original unfiltered GTF file hg19-ensembl.gtf.

### Using mkref

To create a reference, use the cellranger mkref command, passing it one or more matching sets of FASTA and GTF files. The input files must meet the requirements above. This utility copies your FASTA and GTF, indexes these in several formats, and outputs a folder with the name you pass to --genome. Note that the genome name is an identifier, not a path, and should contain only alphanumeric characters and optionally period, hyphen, and underscore characters (. - _).

...
$ls hg19 fasta/ genes/ pickle/ reference.json star/  #### Multiple Species To create a reference for multiple species, run the mkref command with multiple FASTA and GTF files. This is similar to the single species case above, but note that the order of the arguments matters. The arguments are grouped by the order they appear; for instance, the first --genome option listed corresponds to the first --fasta and --genes options listed. $ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \
--genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
...
ls hg19_and_mm10 fasta/ genes/ pickle/ reference.json star/  #### System Requirements Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory. We recommend you run the mkref command with --nthreads equal to the number of cores available on your system. You can also specify the amount of memory cellranger should use during alignment via STAR. Use --memgb to specify the amount of memory to use, in GB; the default is 16 GB. Please note the amount of memory your reference uses during alignment must be greater than the number of gigabases in the input FASTA file. ## Generating the Cell Ranger reference package The references in the Cell Ranger reference package were generated with these tools. When creating the Cell Ranger hg19 reference, the GTF file downloaded from ENSEMBL was filtered using the following cellranger mkgtf command.  cellranger mkgtf hg19-ensembl.gtf hg19-filtered-ensembl.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lincRNA \
--attribute=gene_biotype:antisense


Additionally, "chr" was prepended to the chromosome entries in the gtf.

The hg19 FASTA was then downloaded from UCSC and once alternate haplotype chromosomes were removed (any chromsome containing hap e.g. chr4_ctg9_hap1), running cellranger mkref as described above produced the Cell Ranger hg19 reference.

cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf  The Cell Ranger mm10 reference was generated similarly using filtered ENSEMBL GTF and UCSC FASTA files. ## Generating a Cell Ranger compatible "pre-mRNA" reference package The single-nuclei RNA-seq assay captures unspliced pre-mRNA as well as mature mRNA. However, after alignment, cellranger count only counts reads aligned to exons. Since the pre-mRNA will generate intronic reads, it may be useful to create a custom “pre-mRNA” reference package, listing each gene transcript locus as an exon. Thus, these intronic reads will be included in the UMI counts for each gene and barcode. A custom pre-mRNA reference package can be easily created from an existing Cell Ranger reference package in 2 steps. Starting with the pre-built GRCh38 reference package, as an example: ### 1. Create a "pre-mRNA" GTF Extract GTF annotation rows for transcripts based on the feature type transcript (column 3) of the original tab-delimited GTF and replace the feature type from transcript to exon. Here's a script do this using the Linux utility awk.  awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{$3="exon"; print}' \
refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf > GRCh38-1.2.0.premrna.gtf


### 2. Run cellranger mkref

Use the unmodified genome.fa file and the new GTF file as inputs to cellranger mkref.

\$ cellranger mkref --genome=GRCh38-1.2.0_premrna \
--fasta=refdata-cellranger-GRCh38-1.2.0/fasta/genome.fa \
--genes=GRCh38-1.2.0.premrna.gtf


Please note that when providing a custom pre-mRNA reference package (via --transcriptome) to cellranger count, it is recommended that the --chemistry option is set to the appropriate value to bypass the chemistry auto-detection step which may require a large memory footprint.  The memory usage in this step scales with the size of the transcriptome.

Provided that you follow the format described above, it is fairly simple to add gene definitions to an existing reference. First, add the additional FASTA sequence records to the fasta/genome.fa file. Next, update the GTF file. The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.

Column Name Description
1 Chromosome Must refer to a chromosome/contig in the genome fasta.
2 Source Unused.
3 Feature Cell Ranger count only uses rows where this line is exon.
4 Start Start position on the reference (1-based inclusive).
5 End End position on the reference (1-based inclusive).
6 Score Unused.
7 Strand Strandedness of this feature on the reference: + or -.
8 Frame Unused.
9 Attributes A semicolon-delimited list of key-value pairs of the form key "value". The attribute keys transcript_id and gene_id are required; gene_name is optional and may be non-unique, but if present will be preferentially displayed in reports.

After adding the necessary records to your FASTA file and the additional lines to your GTF file, run cellranger mkref as normal.