HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Gene Expression

Build a Custom Reference With cellranger mkref

10X Genomics provides pre-built references for human and mouse genomes to use with Cell Ranger. Researchers can make custom reference genomes for additional species or add custom marker genes of interest to the reference, e.g. GFP. The following tutorial outlines the steps to build a custom reference using the cellranger mkref pipeline.

In this tutorial, you will learn how to:

Find the Input Files You Need

This tutorial follows the same steps used to create the 10X Genomics pre-built references for human and mouse. These steps can be found on this page: Build Notes for Reference Packages.

First, locate the reference genome FASTA and GTF files for your species. If the species is available from the Ensembl database, we recommend using the files from there. The GTF files from Ensembl contain optional tags that make filtering easy. If your species of interest is not available from Ensembl, GTF and FASTA files from other sources can also work. Note that a GTF file is required, while a GFF file is not supported. (See GFF/GTF File Format - Definition and supported options)

This tutorial generates a custom reference for the zebrafish, Danio rerio.

The files needed are located here in Ensembl.

Navigate to the Gene annotation section of the Ensembl website and click on the Download GTF link. This takes you to an ftp site with a list of GTF files available. Select the file called Danio_rerio.GRCz11.99.chr.gtf.gz. This is the GTF annotation file for this species. All species in Ensembl have similar files available to download. For more information on the GTF files in Ensembl, click on the README file.

Right-click the link to copy the address, paste the URL into the command line, and download using the wget command:

The file is approximately 20 MB and takes less than a minute to download depending on your system.

wget ftp://ftp.ensembl.org/pub/release-99/gtf/danio_rerio/Danio_rerio.GRCz11.99.chr.gtf.gz

Decompress the file with the gunzip command:

gunzip Danio_rerio.GRCz11.99.chr.gtf.gz

Next, navigate back to the Ensembl page for Danio rerio and click on Download FASTA to access the ftp site containing several types of FASTA files. Select dna to access the directory with genome files. Download the FASTA file containing all the chromosomes together in the genome, which has primary assembly in the filename. Right-click on the link to copy the address. Paste the URL into the comandline and download it with the wget command:

The file is approximately 400 MB and takes several minutes to download, depending on your system.

wget ftp://ftp.ensembl.org/pub/release-98/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

Decompress the file with the gunzip command:

gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

Filter the GTF

GTF files can contain entries for non-polyA transcripts that overlap with protein-coding gene models. These entries can cause reads to be flagged as mapped to multiple genes (multi-mapped) because of the overlapping annotations. In the case where reads are flagged as multi-mapped, they are not counted. See article on Which reads are considered for UMI counting by Cell Ranger. To remove these entries from the GTF, add this filter argument to the mkgtf command: --attribute=gene_biotype:protein_coding. If you are interested in seeing all of the filters used to build references available on our support site, click here. If you are using a GTF file that does not contain gene_biotype attributes or is missing other entries, don't worry too much; there may still be enough information to build a reference. A minimal GTF file only needs to contain exon features for protein coding genes.

Setup the command:

cellranger mkgtf \
Danio_rerio.GRCz11.99.chr.gtf \
Danio_rerio.GRCz11.98.chr.filtered.gtf \
--attribute=gene_biotype:protein_coding

This will output the file Danio_rerio.GRCz11.98.chr.filtered.gtf, which will be used in the next step.

Setup the Command for cellranger mkref

Now that you have the genome FASTA and filtered GTF files needed, set up the command to run the cellranger mkref pipeline.

The following is the command:

cellranger mkref \
--genome=Danio.rerio_genome \
--fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa \
--genes=Danio_rerio.GRCz11.98.chr.filtered.gtf

Run cellranger mkref

Run the command. This can take several hours, depending on your system. If you are working on a shared computing environment such as an HPC cluster, submit this as a job to prevent competing with other users for resources.

The output looks similar to this:

cellranger mkref (3.1.0)
Copyright (c) 2019 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Creating new reference folder at Danio.rerio_genome
...done

Writing genome FASTA file into reference folder...
...done

Computing hash of genome FASTA file...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Computing hash of genes GTF file...
...done

Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)...
...done

Writing genome metadata JSON file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Apr 28 09:45:16 ..... Started STAR run
Apr 28 09:45:16 ... Starting to generate Genome files
Apr 28 09:45:52 ... starting to sort  Suffix Array. This may take a long time...
Apr 28 09:45:56 ... sorting Suffix Array chunks and saving them to disk...
Apr 28 10:05:40 ... loading chunks from disk, packing SA...
Apr 28 10:05:58 ... Finished generating suffix array
Apr 28 10:05:58 ... Generating Suffix Array index
Apr 28 10:07:48 ... Completed Suffix Array index
Apr 28 10:07:48 ..... Processing annotations GTF
Apr 28 10:07:57 ..... Inserting junctions into the genome indices
Apr 28 10:14:32 ... writing Genome to disk ...
Apr 28 10:14:35 ... writing Suffix Array to disk ...
Apr 28 10:14:44 ... writing SAindex to disk
Apr 28 10:14:48 ..... Finished successfully
...done.

Reference successfully created!

You can now specify this reference on the command line:
cellranger --transcriptome=Danio.rerio_genome ...

The reference was successfully created, as noted in the output above. If you do not see this message, there was probably an error that occured. Please copy the error message and send an email to [email protected].

Add a Marker Gene to the FASTA and GTF

There are cases where the publicly-available GTF and FASTA files will not contain information for some of the genes expressed in a given sample. A transgenic sample is a good example of when you would not expect a gene of interest to be in the reference. In this example, the common marker gene, Green Fluorescent Protein (GFP) (used as an in-vivo fluorescent reporter for gene expression) is added to the reference. This method of adding genes to a reference has been reported to work for detecting genes from viral infections provided the detected transcripts are polyadenylated.

Note: This is only one of many mRNA sequences available encoding for GFP. Make sure to use the sequence specific for your assay.

Next, get the GFP FASTA file from the European Nucleotide Archive:

wget -O GFP_orig.fa https://www.ebi.ac.uk/ena/browser/api/fasta/AAA27722.1?download=true

The header of this file looks like the following:

>ENA|AAA27722|AAA27722.1 Aequorea victoria green-fluorescent protein

There are special characters such as "|" and spaces in the header (all text after the >) of this FASTA sequence. These can be problematic for downstream applications. It can be helpful to change the header to be more informative and also to remove these characters. The following command opens the file and uses the stream editor (sed) function to search for a pattern (the original header), replace it with new text (GFP), then directs the output to a new output file, GFP.fa.

cat GFP_orig.fa | sed s/ENA\|AAA27722\|AAA27722\.\1\ Aequorea\ victoria\ green\-fluorescent\ protein/GFP/ > GFP.fa

Note: Another option is to open the GFP_orig.fa file with a text editor, such as nano, then manually edit the header and save the file as GFP.fa. Choose whichever method of changing the header you feel most comfortable with.

Now the FASTA file GFP.fa looks like the following:

>GFP
ATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGT
GATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGA
AAACTTACCCTTAAATTTATTTGCACTACTGGAAAGCTACCTGTTCCATGGCCAACACTT
GTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAG
CATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTAC
AAAGATGACGGGAACTACAAATCACGTGCTGAAGTCAAGTTTGAAGGTGATACCCTCGTT
AATAGAATTGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAA
ATGGAATACAACTATAACTCACACAATGTATACATCATGGCAGACAAACAAAAGAATGGA
ATCAAAGTTAACTTCAAAATTAGACACAACATTGAAGATGGAAGCGTTCAACTAGCAGAC
CATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTAC
CTGTCCACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTT
CTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAA

To find the number of bases in this sequence, we will use the grep -v "^>" command to search all lines that don't start with the > character, which removes line returns with tr -d "\n" so they aren't counted, and then counts the number of characters with the command wc -c. Each command is sent to the next step with the pipe "|" command.

The results of this command shows there are 717 bases. This is important to know for the next step.

cat GFP.fa | grep -v "^>" | tr -d "\n" | wc -c

Now, make a custom GTF for GFP with the following command. This command uses the function echo -e (prints everything in quotes; the -e enables interpretation of the backslash, e.g. \t). Use \t to insert the tabs that separate the 9 columns of information required for GTF.

echo -e 'GFP\tunknown\texon\t1\t717\t.\t+\t.\tgene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";' > GFP.gtf

This is what the GFP.gtf file looks like with the cat GFP.gtf command:

GFP	unknown	exon	1	717	.	+	.	gene_id GFP; transcript_id GFP; gene_name GFP; gene_biotype protein_coding;

Next, add the GFP.fa to the end of the D. rerio genome FASTA. But first, make a copy so that the original is unchanged.

cp Danio_rerio.GRCz11.dna.primary_assembly.fa Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa 

Then, append the GFP.fa to the end of the Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa file. The >> means append. Note: Do not use >, which overwrites the original file.

cat GFP.fa >> Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

To confirm that the GFP entry was added to the FASTA file, use the grep ">" command to search for lines with the > character:

grep ">" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

The output looks similar to the following:

>1 dna:chromosome chromosome:GRCz11:1:1:59578282:1 REF
>10 dna:chromosome chromosome:GRCz11:10:1:45420867:1 REF
>11 dna:chromosome chromosome:GRCz11:11:1:45484837:1 REF
>12 dna:chromosome chromosome:GRCz11:12:1:49182954:1 REF
>13 dna:chromosome chromosome:GRCz11:13:1:52186027:1 REF
>14 dna:chromosome chromosome:GRCz11:14:1:52660232:1 REF
>15 dna:chromosome chromosome:GRCz11:15:1:48040578:1 REF
>16 dna:chromosome chromosome:GRCz11:16:1:55266484:1 REF
>17 dna:chromosome chromosome:GRCz11:17:1:53461100:1 REF
>18 dna:chromosome chromosome:GRCz11:18:1:51023478:1 REF
>19 dna:chromosome chromosome:GRCz11:19:1:48449771:1 REF
>2 dna:chromosome chromosome:GRCz11:2:1:59640629:1 REF
>20 dna:chromosome chromosome:GRCz11:20:1:55201332:1 REF
>21 dna:chromosome chromosome:GRCz11:21:1:45934066:1 REF
>22 dna:chromosome chromosome:GRCz11:22:1:39133080:1 REF
>23 dna:chromosome chromosome:GRCz11:23:1:46223584:1 REF
>24 dna:chromosome chromosome:GRCz11:24:1:42172926:1 REF
>25 dna:chromosome chromosome:GRCz11:25:1:37502051:1 REF
>3 dna:chromosome chromosome:GRCz11:3:1:62628489:1 REF
>4 dna:chromosome chromosome:GRCz11:4:1:78093715:1 REF
>5 dna:chromosome chromosome:GRCz11:5:1:72500376:1 REF
>6 dna:chromosome chromosome:GRCz11:6:1:60270059:1 REF
>7 dna:chromosome chromosome:GRCz11:7:1:74282399:1 REF
>8 dna:chromosome chromosome:GRCz11:8:1:54304671:1 REF
>9 dna:chromosome chromosome:GRCz11:9:1:56459846:1 REF
>MT dna:chromosome chromosome:GRCz11:MT:1:16596:1 REF
>KN149696.2 dna:scaffold scaffold:GRCz11:KN149696.2:1:368252:1 REF
>KN147651.2 dna:scaffold scaffold:GRCz11:KN147651.2:1:351968:1 REF
>KN149690.1 dna:scaffold scaffold:GRCz11:KN149690.1:1:343018:1 REF
>KN149686.1 dna:scaffold scaffold:GRCz11:KN149686.1:1:260365:1 REF
>KN147652.2 dna:scaffold scaffold:GRCz11:KN147652.2:1:252640:1 REF
>KN149688.2 dna:scaffold scaffold:GRCz11:KN149688.2:1:252035:1 REF
>KN149691.1 dna:scaffold scaffold:GRCz11:KN149691.1:1:233193:1 REF
...
>GFP

You can also count the number of contigs in the FASTA. There should now be 994 contigs including the extra GFP.

grep -c "^>" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

Use the cp command to make a copy of the original GTF and modify the name to contain GFP. Then use the cat command to append the contents of GFP.gtf to the end of the renamed copy of the filtered D. rerio GTF.

cp Danio_rerio.GRCz11.98.chr.filtered.gtf Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf
cat GFP.gtf >> Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf

Check the file with the following command:

tail Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf

The output looks similar to the following with the GTF entry as the last line of the file:

MT	RefSeq	start_codon	15308	15310	.	+	0	gene_id "ENSDARG00000063924"; gene_version "3"; transcript_id "ENSDART00000093625"; transcript_version "3"; exon_number "1"; gene_name "mt-cyb"; gene_source "RefSeq"; gene_biotype "protein_coding"; transcript_name "mt-cyb-201"; transcript_source "RefSeq"; transcript_biotype "protein_coding";
GFP	unknown	exon	1	717	.	+	.	gene_id GFP; transcript_id GFP; gene_name GFP; gene_biotype protein_coding;

Now use the Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf and Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa files as inputs to the cellranger mkref pipeline:

cellranger mkref --genome=Danio.rerio_genome_GFP --fasta=Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa --genes=Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf

This outputs a custom reference directory called Danio.rerio_genome_GFP.