HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell Multiome ATAC + Gene Exp.

Creating a Custom Reference

A tutorial on using Cell Ranger ARC mkref to create a reference for Rattus norvegicus

Learning objectives

In this tutorial, you will:

Prerequisites

Brief introduction to Cell Ranger ARC mkref

Cell Ranger ARC is a set of analysis pipelines that process Chromium Single Cell Multiome ATAC + Gene Expression sequencing data. Some Cell Ranger ARC pipelines (e.g. cellranger-arc count) require a reference transcriptome as input. In addition to providing pre-built references for human and mouse transcriptomes, Cell Ranger ARC also provides a pipeline called cellranger-arc mkref that enables users to create custom references using a reference genome and its corresponding genome annotation file (GTF) as inputs.

Building the Rattus norvegicus reference transcriptome

Rattus norvegicus (commonly known as rat) is a popular model organism with a well-sequenced genome and transcriptome. This tutorial walks through the process of downloading the rat genome FASTA (mRatBN7.2.105) and GTF files from Ensembl and creating a custom rat reference transcriptome compatible with cellranger-arc.

Start by opening up a terminal window. You may log in to a remote server or choose to create the reference on your local machine. Refer to the System Requirements page for details.

Download the reference FASTA file

In your working directory, download the input rat genome in FASTA format from Ensembl using the wget command, then uncompress the file. Please note that we have chosen a "toplevel" or primary assemblies FASTA because it contains primary contigs and no non-chromosomal or haplotype contigs.

Since this is a large file, the command may take several minutes to complete:

# Download command:
wget http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz
# Uncompress command:
gunzip -v Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz

Use ls to list out all your files. A file named Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa should appear in the working directory.

Download the gene annotation file (GTF)

Similarly, download the gene annotations file (GTF) corresponding to the FASTA file in the working directory and uncompress it:

# Download command:
wget http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.105.gtf.gz
# Uncompress command:
gunzip -v Rattus_norvegicus.mRatBN7.2.105.gtf.gz

Use ls to check that a file called Rattus_norvegicus.mRatBN7.2.105.gtf has appeared in the working directory.

Filter the GTF

GTF filtering is an optional step that can improve mapping and UMI retention. Filtering removes low confidence transcripts and genes, restricts the number of gene classes assigned to a sequence, and removes pseudo-autosomal genes. Filtering parameters used in this tutorial are functionally equivalent to those used by 10x Genomics to create the human and mouse references.

Run this cellranger-arc mkgtf command to restrict the rat GTF to protein-coding, lncRNA, antisense, and immune-related genes:

cellranger-arc mkgtf Rattus_norvegicus.mRatBN7.2.105.gtf filtered_Rattus_norvegicus.mRatBN7.2.105.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene

A successful cellranger-arc mkgtf run ends with this message:

Writing new genes GTF file (may take 10 minutes for a 1GB input GTF file)...
...done

Use the ls command to list files present in the directory. You should see a new file named filtered_Rattus_norvegicus.mRatBN7.2.105.gtf

Create a config file

Next, set up your configuration (or config) file. The config file provides cellranger-arc mkref with all the relevant sample and run information. Copy and paste this text into a TXT file using your text editor of choice (e.g. nano). Name the file mRatBN7.config and save it.

{
    organism: "Rattus_norvegicus"
    genome: ["mRatBN7"]
    input_fasta: ["Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa"]
    input_gtf: ["filtered_Rattus_norvegicus.mRatBN7.2.105.gtf"]
}

The genome, input_fasta, and input_gtf parameters are required and described here:

Parameters Descriptions
genome The specific version name of the organism's genome. This tutorial is for Rattus norvegicus genome version mRatBN7
input_fasta Path to the genome assembly FASTA file Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa
input_gtf Path to the gene annotation file corresponding to the input FASTA file. Use the filtered GTF as the input GFT: filtered_Rattus_norvegicus.mRatBN7.2.105.gt

To view a comprehensive list of all cellranger-arc mkref parameters that can be input into the config file, refer to Step 4 of the Cell Ranger ARC mkref pipeline page.

Run the command to build the custom reference

Finally, run mkref with this command:

cellranger-arc mkref --config=mRatBN7.config

The --config argument supplies the configuration file to cellranger-arc mkfastq. A full list of mkref flags and system requirements can be found on the Custom References page.

Execution begins with a message similar to:

user_prompt$ cellranger-arc mkref --config=mRatBN7.config
>>> Creating reference for mRatBN7 <<<

Creating new reference folder at /working-directory/mRatBN7
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done
...

After several minutes (> 1 hr) and more output on the command prompt, the run should end with a success message similar to:

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Jun 29 12:53:12 ..... started STAR run
Jun 29 12:53:12 ... starting to generate Genome files
Jun 29 12:55:02 ... starting to sort Suffix Array. This may take a long time...
Jun 29 12:55:20 ... sorting Suffix Array chunks and saving them to disk...
Jun 29 13:42:03 ... loading chunks from disk, packing SA...
Jun 29 13:42:43 ... finished generating suffix array
Jun 29 13:42:43 ... generating Suffix Array index
Jun 29 13:48:12 ... completed Suffix Array index
Jun 29 13:48:12 ..... processing annotations GTF
Jun 29 13:48:26 ..... inserting junctions into the genome indices
Jun 29 13:59:55 ... writing Genome to disk ...
Jun 29 14:00:00 ... writing Suffix Array to disk ...
Jun 29 14:00:15 ... writing SAindex to disk
Jun 29 14:00:19 ..... finished successfully
...done.

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

Generating bwa index (may take over an hour for a 3Gb genome)...
[bwa_index] Pack FASTA... 42.42 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=5295831456, availableWord=384634008
[BWTIncConstructFromPacked] 10 iterations done. 100000000 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 200000000 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 300000000 characters processed.
...
....
[BWTIncConstructFromPacked] 590 iterations done. 5280752896 characters processed.
[bwt_gen] Finished constructing BWT in 598 iterations.
[bwa_index] 2612.22 seconds elapse.
[bwa_index] Update BWT... 23.61 sec
[bwa_index] Pack forward-only FASTA... 26.37 sec
[bwa_index] Construct SA from BWT and Occ... 1226.05 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /working-directory/mRatBN7/fasta/genome.fa
[main] Real time: 3940.825 sec; CPU: 3929.699 sec
done

Writing TSS and transcripts bed file...
...done

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

>>> Reference successfully created at mRatBN7 <<<

Using the tree -L 2 command inside the newly created mRatBN7 folder, you can examine the output files:

mRatB7n
  ├── fasta
  │   ├── genome.fa
  │   ├── genome.fa.amb
  │   ├── genome.fa.ann
  │   ├── genome.fa.bwt
  │   ├── genome.fa.fai
  │   ├── genome.fa.pac
  │   └── genome.fa.sa
  ├── genes
  │   └── genes.gtf.gz
  ├── reference.json
  ├── regions
  │   ├── transcripts.bed
  │   └── tss.bed
  └── star
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

Path to the reference transcriptiome folder is required to run cellranger-arc pipelines.

Next steps