Cell Ranger7.1, printed on 11/23/2024
This page provides some details about the prebuilt human and mouse reference sequences (downloadable here). These reference were created from V(D)J genes in Ensembl build 94, using both gtf and gff3 files, which provide slightly different information. The process is mechanical, and supplemented by manual edits, which we describe in detail here. There are two cases where we created unofficial gene names, and these should ultimately be replaced by official names.
One consequence of these changes is that all V gene sequences now begin with a start codon ATG. (This lies in the leader sequence coding for a signal peptide that is cleaved off.)
For the prebuilt references, pseudogenes are excluded, except where we think the pseudogene labeling is incorrect, for specific instances described below.
Deleted TRBV6-3 because it is not in the Ensembl GTF and is nearly identical to TRBV6-2.
Deleted IGHV1/OR15-9 because they are labeled non-functional by NCBI.
Allowed TRAJ8, TRAV35 and TRBV21-1 even though they are labeled pseudogenes, because they are observed in productive pairs.
Added TRAJ15, TRBD2, TRBV11-2 and TRGV11 because they are observed in productive pairs.
Added IGHV1-8, IGKV2-18 and IGLV6-57 because they are observed in productive pairs.
Added 1 base to the right of TRAJ36 because otherwise annotations fail the in-frame requirement for productive contig, which all other observed human and mouse J genes satisfy.
Trimmed 3 bases from the right of TRAJ37 because otherwise one finds a three-base indel when one annotates the J/C junction (in observed data).
Trimmed 89 bases from the left of IGLJ1, because these bases are not part of the actual J segment.
Trimmed 104 bases from the left of IGLJ2, because these bases are not part of the actual J segment.
Trimmed 113 bases from the left of IGLJ3, because these bases are not part of the actual J segment.
Trimmed 57 bases from the left of TRBV20/OR9-2, because these bases are not part of the actual V segment.
Added an alternate form of TRBV20-1, differing from the reference by a 3-base insertion, because we observe this form (here and below, we use the same name for the alternate inserted form).
Added an alternate form of TRBV7-7, differing from the reference by a 15-base insertion, because we observe this form.
Add an allele of the gene IGHJ6.
Remove the first base of the C region in certain cases. In these cases we observe that in most transcripts, the J region and C region overlap by exactly one base.
Replace IGKV2D-40, whose leader sequence appears truncated.
Delete IGKV2-18, although we had previously added it. It is probably a pseudogene.
Delete IGLV5-48. It is truncated on the right.
Delete TRBV21-1, which has multiple frameshifts.
Add IGHV4-30-4, which was missing.
Add IGKV1-NL1, which was missing.
Add IGHV4-38-2, which was missing.
Deleted IGHV1-67, because it is labeled a pseudogene by NCBI.
Added IGHV12-1, because we observe this gene in data.
Added a V gene observed in two BALB/c datasets, aligning to an unplaced sequence in the BALB/c whole genome assembly, and we unofficially named this gene IGHV1-unknown1.
Added a form of TRAV4-4-DV10 seen in BALB/c.
Added a form of TRAV13-1 or TRAV13D-1 seen in BALB/c, and arbitrarily labeled TRAV13-1.
Added a very common alternate splicing between the first exon of TRBV12-2 and the second exon of TRBV13-2, which we unofficially named TRBV12-2+TRBV13-2.
Added an alternate form of TRAV16N, differing from the reference form by having a 3-base insertion, because we observe this in data.
Added an alternate form of TRAV6N-5, differing from the reference by a 3-base insertion, because we observe this in data.
Added an alternate form of TRAV13N-4, differing from the reference by a 15-base insertion, because we observe this in data.
Added an alternate form of TRBV13-2, differing from the reference by a 21-base insertion, because we observe this in data.
Remove the first base of the C region in certain cases. In these cases we observe that in most transcripts, the J region and C region overlap by exactly one base.
Delete TRAV23, which is frameshifted.
Delete the first base of the constant region gene IGHG2B.
Make a six-base insertion in IGKV12-89, based on empirical data.
Correct IGHV8-9, whose amino acid sequence showed the canonical C at the end of FWR3 as S. This is consistent with observations from 10x data.
Add a missing allele of IGKV2-109.
Add missing gene IGKV4-56.
Add missing gene IGHV1-2.