Cell Ranger2.0, printed on 09/20/2024
Cell Ranger provides a pre-built human reference package for use with the pipeline. If you would like to use your genome FASTA or gene GTF annotations, Cell Ranger supports the use of customer-generated references.
The cellranger mkvdjref tool can be used to generate a custom reference package.
$ cellranger mkvdjref --genome=my_vdj_ref \ --fasta=GRCh38_ensembl.fasta \ --genes=GRCh38_ensembl.gtf
A Cell Ranger V(D)J reference consists of germline gene segment sequences. It assumes that these sequences are contained within a genome reference FASTA, and that a gene annotation GTF points to the relevant gene segments. Currently it assumes the GTF is in an Ensembl-like format. If you are using a transcriptome- or segment- based V(D)J reference rather than a genome-based reference, you can make the "chromosomes" be the transcripts and construct a GTF which annotates the transcripts appropriately.
cellranger mkvdjref expects a FASTA file containing genomic reference sequences whose names are consistent with the names used in the GTF file.
Cell Ranger V(D)J expects a GTF file in an Ensembl-like format that contains information about V(D)J gene segments.
GTF Column | Name | Description |
---|---|---|
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | Cell Ranger only uses rows where this line is equal to one of CDS or five_prime_utr . |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys used by Cell Ranger V(D)J are detailed below. |
GTF Attribute | Description |
---|---|
transcript_id | Becomes the record_id in the Cell Ranger V(D)J reference entry format. |
transcript_biotype | The value is used to infer the V(D)J segment type. Either transcript_biotype or gene_biotype must be a value in the "Accepted Biotypes" list below. If transcript_biotype is not on the accepted list, then gene_biotype is used. |
gene_biotype | See transcript_biotype . |
gene_name | Must be specified. Becomes the gene_name in the Cell Ranger V(D)J reference entry format. |
TR_C_gene
TR_D_gene
TR_J_gene
TR_V_gene
IG_C_gene
IG_D_gene
IG_J_gene
IG_V_gene
14 havana CDS 21621904 21621946 . + 0 transcript_id "ENST00000542354"; gene_name "TRAV1-1"; transcript_biotype "TR_V_gene";
cellranger mkvdjref creates a directory whose named is specified by the --genome
argument.
$ tree my_vdj_ref my_vdj_ref ├── fasta │ └── regions.fa └── reference.json
The Cell Ranger V(D)J human reference package refdata-cellranger-vdj-GRCh38-alts-ensembl-2.0.0 was generated with the following steps.
Homo_sapiens.GRCh38.dna.toplevel.fa
Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf
vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf
vdj_GRCh38_alts_ensembl_10x_ignore_transcripts-2.0.0.txt
This reference was constructed by adding to and removing some entries from the Ensembl GTF. Adding entries from multiple GTFs is accomplished by specifying the --genes
argument multiple times. Entries are removed by providing a list of transcript IDs to the --rm-transcripts
argument. For details please see cellranger mkvdjref --help
$ wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz $ gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
$ wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz $ gunzip Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz
$ cellranger mkvdjref --genome vdj_GRCh38_alts_ensembl \ --fasta=Homo_sapiens.GRCh38.dna.toplevel.fa \ --genes=Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf \ --genes=vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf \ --rm-transcripts=vdj_GRCh38_alts_ensembl_10x_ignore_transcripts-2.0.0.txt \ --ref-version=2.0.0