Cell Ranger2.1, printed on 11/23/2024
Cell Ranger provides a pre-built human reference package for use with the pipeline. Our reference is based on the T cell receptor (TRA, TRB) and B cell immunoglobin (IGH, IGL, IGK) gene annotations in Ensembl version 87. If you would like to use your own genome FASTA or gene GTF annotations, Cell Ranger supports the use of customer-generated Ensembl-based references. Cell Ranger also includes support for generating a V(D)J reference from the IMGT database.
There are two ways to generate a V(D)J reference:
The cellranger mkvdjref tool can be used to generate a custom reference package from a genome sequence FASTA File and a gene annotation GTF.
$ cellranger mkvdjref --genome=my_vdj_ref \ --fasta=GRCh38_ensembl.fasta \ --genes=GRCh38_ensembl.gtf
A Cell Ranger V(D)J reference consists of germline gene segment sequences. It assumes that these sequences are contained within a genome reference FASTA, and that an Ensembl-formatted gene annotation GTF points to the relevant gene segments.
cellranger mkvdjref expects a FASTA file containing genomic reference sequences whose names are consistent with the names used in the GTF file.
Cell Ranger V(D)J expects a GTF file in an Ensembl-like format that contains information about V(D)J gene segments.
Column | Name | Description |
---|---|---|
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | Cell Ranger vdj only uses rows where this line is equal to one of CDS or five_prime_utr . |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys used by Cell Ranger V(D)J are detailed below. |
GTF Attribute | Description |
---|---|
transcript_id | Becomes the record_id in the Cell Ranger V(D)J reference entry format. |
transcript_biotype | The value is used to infer the V(D)J segment type. Either transcript_biotype or gene_biotype must be a value in the "Accepted Biotypes" list below. If transcript_biotype is not on the accepted list, then gene_biotype is used. |
gene_biotype | See transcript_biotype . |
gene_name | Must be specified. Becomes the gene_name in the Cell Ranger V(D)J reference entry format. |
TR_C_gene
TR_D_gene
TR_J_gene
TR_V_gene
IG_C_gene
IG_D_gene
IG_J_gene
IG_V_gene
14 havana CDS 21621904 21621946 . + 0 transcript_id "ENST00000542354"; gene_name "TRAV1-1"; transcript_biotype "TR_V_gene";
cellranger mkvdjref creates a directory whose named is specified by the --genome
argument.
$ tree my_vdj_ref my_vdj_ref ├── fasta │ └── regions.fa └── reference.json
The Cell Ranger V(D)J human reference package refdata-cellranger-vdj-GRCh38-alts-ensembl-2.0.0 was generated with the following steps.
Homo_sapiens.GRCh38.dna.toplevel.fa
Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf
vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf
vdj_GRCh38_alts_ensembl_10x_ignore_transcripts-2.0.0.txt
This reference was constructed by adding to and removing some entries from the Ensembl GTF. Adding entries from multiple GTFs is accomplished by specifying the --genes
argument multiple times. Entries are removed by providing a list of transcript IDs to the --rm-transcripts
argument. For details please see cellranger mkvdjref --help
$ wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz $ gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
$ wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz $ gunzip Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz
$ cellranger mkvdjref --genome vdj_GRCh38_alts_ensembl \ --fasta=Homo_sapiens.GRCh38.dna.toplevel.fa \ --genes=Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf \ --genes=vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf \ --rm-transcripts=vdj_GRCh38_alts_ensembl_10x_ignore_transcripts-2.0.0.txt \ --ref-version=2.0.0
The cellranger mkvdjref tool can be used to generate a custom reference package from a FASTA file containing V(D)J segment sequences and associated metadata.
$ cellranger mkvdjref --genome=my_vdj_ref \ --seqs=imgt_vdj.fasta
This is a FASTA file where the description line contains V(D)J-specific metadata.
>id|display_name record_id|gene_name|region_type|chain_type|chain|isotype|allele_name SEQUENCE
Field | Description |
---|---|
id | Unique integer ID for this feature. |
display_name | This is used when displaying the segment in, e.g., Loupe V(D)J Browser. |
record_id | Describes the accession ID of the source molecule. Unused. |
gene_name | The name of the V(D)J gene, e.g. TRBV2-1. |
region_type | The only used values are L-REGION+V-REGION , D-REGION , J-REGION , and C-REGION . |
chain_type | Specifies whether this is a T- or B- cell receptor chain. The only used values are TR and IG. |
isotype | Specifies the class of heavy chain constant region; set to None if not applicable. |
allele_name | The identifier for the allele, e.g. 01 for TRBV2-1*01, or None if no allele is to be specified. |
>1|TRAV1*01 AF259072|TRAV1|L-REGION+V-REGION|TR|TRA|None|01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT >979|IGHA*01 J00475|IGHA|C-REGION|IG|IGH|A|01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
Cell Ranger comes with a script called fetch-imgt
which downloads the relevant sequences from IMGT and generates a V(D)J segment FASTA file. This is then used to generate a V(D)J reference package.
This example generates a mouse V(D)J reference based on IMGT.
# source the environment of CellRanger 2.1.1 for your shell (bash/csh) # (for bash shell) source path/to/cellranger-2.1.1/sourceme.bash # OR (for C shell) source path/to/cellranger-2.1.1/sourceme.csh # You might need to install following python packages pip install lxml pip install biopython # Using a script that comes with Cell Ranger, get data from IMGT and create a FASTA suitable for use by mkvdjref # The option --species is the name of the species for which the data is to be downloaded. # The option --genome provides the prefix used to name the 2 output files. Only the file with suffix -mkvdjref-input.fasta is used as input to the mkvdjref utility. path/to/cellranger-2.1.1/cellranger-cs/2.1.1/lib/bin/fetch-imgt --genome vdj_IMGT_mouse --species "Mus musculus" # Build the CR reference. could also include Cell Ranger on your PATH to avoid specifying the full path for cellranger. # The option --genome is a single identifier with no special symbols aside from hyphen or underscore. The reference will be placed in a directory created with that name. # The option --seqs is the mkvdjref-input.fasta file generated by the fetch-imgt command. path/to/cellranger-2.1.1/cellranger mkvdjref --genome=vdj_IMGT_mouse --seqs=vdj_IMGT_mouse-mkvdjref-input.fasta