Probe Set Reference CSV and Supporting Files

The following probe set reference and supporting files are available on the Cell Ranger downloads page and below:

File	Description
Probe set reference CSV file	This CSV file is a required input for Cell Ranger to enable analysis of Single Cell Fixed RNA Profiling data. It specifies the probe sequences used for probe alignment.
Probe off-target activity CSV	A CSV file that lists probes with predicted off-target activity, excluded from analysis by default.
Probe set BED file	A BED12 file that contains the sequences and genomic coordinates of the probes. This file can be used to visualize the probe locations in a genome browser and intersect probe locations with other data sources.
Probe set metadata TSV file	A TSV file that lists probes with additional information about gene name and description.
Probe Barcode sequence TXT file	A text file with the sequences of the Probe Barcodes. This file is available in the Cell Ranger software tarball, on the Cell Ranger downloads page, and below.

Download probe set reference and supporting files

Cell Ranger v7.1 is compatible with both the v1.0 and v1.0.1 probe set reference CSV files (new region column). Cell Ranger v7.0/7.0.1 are compatible with the v1.0 probe set reference CSV files.

Probe identifiers

Files containing information about individual probes have a column corresponding to the probe identifier (ID) that uniquely identifies each probe. Probe IDs take the following format:


gene_id|gene_name|probe_sequence_hash

For example, the probe for the gene TSPAN6 in the human whole transcriptome probe set, which has the Ensembl gene ID ENSG00000000003 in the GRCh38-2020-A reference, has the probe ID ENSG00000000003|TSPAN6|41ef80c.

A small number of probes whose ID includes the prefix DEPRECATED are excluded from analysis by default.

File formats for probe set downloads

Probe set reference CSV file

This CSV file is a required input for Cell Ranger to enable analysis of Fixed RNA Profiling data. It specifies the sequences used as a reference for probe alignment and the gene ID associated with each probe. See a description of the probe-set argument in the cellranger multi config CSV documentation. This file is provided in the Cell Ranger tarball.

The following snippet is an example from a v1.0 probe set reference CSV file:


#probe_set_file_format=1.0
#panel_name=Chromium Human Transcriptome Probe Set
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
gene_id,probe_seq,probe_id,included
ENSG00000000003,GGTGA[...]ATGGC,ENSG00000000003|TSPAN6|8eab823,TRUE
ENSG00000000003,TCTGC[...]TTAGG,ENSG00000000003|TSPAN6|9d7fe51,TRUE
[ ... ]

The v1.0.1 probe set has a new region column, which indicates whether a probe spans a splice junction by at least 10 bp (spliced) or not (unspliced).


#probe_set_file_format=2.0
#panel_name=Chromium Human Transcriptome Probe Set v1.0.1
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
gene_id,probe_seq,probe_id,included,region
ENSG00000000003,GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC,ENSG00000000003|TSPAN6|8eab823,TRUE,spliced
ENSG00000000003,TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG,ENSG00000000003|TSPAN6|9d7fe51,TRUE,unspliced
[ ... ]

The columns of this file are:

Column Name	Description
`gene_id`	The Ensembl gene identifier targeted by the probe.
`probe_seq`	The nucleotide sequence of the probe, which is complementary to the transcript sequence.
`probe_id`	The probe identifier, whose format is described in Probe identifiers.
`included`	A `TRUE` or `FALSE` flag specifying whether the probe is included in the filtered counts matrix output or excluded by the probe filter. See `filter-probes` option of cellranger multi. All probes of a gene must be marked `TRUE` in the `included` column for that gene to be included.
`region`	Present only in v1.0.1 probe set reference CSV. The gene boundary targeted by the probe. Accepted values are spliced or unspliced.

The file also contains a number of required metadata fields in the header in the format #key=value:

Metadata Field	Description
`panel_name`	The name of the probe set.
`panel_type`	Always `predesigned` for predesigned probe sets.
`reference_genome`	The reference genome build used for probe design.
`reference_version`	The version of the Cell Ranger reference transcriptome used for probe design.
`probe_set_file_format`	The version of the probe set file format specification that this file conforms to.

Probe off-target activity CSV file

This CSV file lists probes with predicted off-target activity identified by alignment to the reference transcriptome.

The following snippet is an example of a probe off-target activity CSV file:


probe_id,off_target_genes
ENSG00000004478|FKBP4|ed8be23,ENSG00000235256|FKBP4P7;ENSG00000251463|FKBP4P1;ENSG00000268234|FKBP4P6;ENSG00000269692|FKBP4P2;ENSG00000276457|FKBP4P8
ENSG00000005302|MSL3|46ea040,ENSG00000224287|MSL3P1;ENSG00000239254|AC009220.2
ENSG00000006015|REX1BD|8367ca6,ENSG00000130766|SESN2
ENSG00000011009|LYPLA2|f905390,ENSG00000228285|LYPLA2P1;ENSG00000236604|LYPLA2P3;ENSG00000269153|LYPLA2P2
ENSG00000029363|BCLAF1|4400f06,ENSG00000248966|BCLAF1P1
ENSG00000048545|GUCA1A|42f457b,ENSG00000287363|AL096814.2
ENSG00000051596|THOC3|2d7334d,ENSG00000170089|AC106795.1
ENSG00000055955|ITIH4|e3e9693,ENSG00000243696|AC006254.1
[ ... ]

The columns for this file are:

Column Name	Description
`probe_id`	The ID of the probe with predicted off-target activity.
`off_target_genes`	A semicolon-separated list of predicted off-target genes. For each off-target gene, the Ensembl gene ID and gene symbol are separated by a vertical bar.

Probe BED file

A BED12-formatted file that contains the sequences and genomic coordinates of the probes. This file may be used to visualize the probe locations with genome browsers like IGV (Integrated Genomics Viewer) and the UCSC Genome Browser or to intersect the probe locations with other genomic features of interest using tools like bedtools.

The following snippet is from an example BED12 file:


chr1	69485	69535	ENSG00000186092|OR4F5|118842c	0	-	69485	69535	0	1	50	    0
chr1	926007	930198	ENSG00000187634|SAMD11|ce0f9f1	0	-	926007	930198	0	2	6,44	0,4147
chr1	930204	930254	ENSG00000187634|SAMD11|03fbb84	0	-	930204	930254	0	1	50	    0
chr1	942810	942860	ENSG00000187634|SAMD11|fbf73f3	0	-	942810	942855	0	1	50	    0
chr1	958990	959040	ENSG00000188976|NOC2L|e191323	0	+	958990	959040	0	1	50	    0
chr1	952099	952421	ENSG00000188976|NOC2L|95b770b	0	+	952099	952421	0	2	40,10	0,312
chr1	944599	944649	ENSG00000188976|NOC2L|199f6ec	0	+	0	    0	    0	1	50	    0
chr1	962784	962834	ENSG00000187961|KLHL17|f5213de	0	-	962784	962834	0	1	50	    0
chr1	963238	963371	ENSG00000187961|KLHL17|82b4408	0	-	963238	963371	0	2	15,35	0,98
chr1	963416	963466	ENSG00000187961|KLHL17|f9ac954	0	-	963416	963466	0	1	50	    0
chr1	966584	966723	ENSG00000187583|PLEKHN1|665a14d	0	-	966584	966723	0	2	30,20	0,119
[...]

The columns of BED12 files we provide are as follows (adapted from UCSC Genome Browser documentation):

Column Name	Description
`chromosome`	Chromosome of the target gene.
`chromStart`	0-based start coordinate of the targeted sequence on the chromosome.
`chromEnd`	0-based non-inclusive end coordinate on the chromosome.
`name`	Probe ID as described above.
`score`	Set to `0` for all entries.
`strand`	`+` or `-` to indicate the strand of the targeted gene.
`thickStart`	The starting position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
`thickEnd`	The ending position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
`itemRgb`	Set to `0` for all entries.
`blockCount`	The number of blocks (continuous intervals).
`blockSizes`	Comma-separated list of the block sizes, contains blockCount entries.
`blockStarts`	Comma-separated list of block starts relative to `chromStart` column, contains blockCount entries.

The BED12 format was chosen because it allows probes that span splice junctions to be conveniently represented on a single line and allows genome browsers to visualize links between regions of probes that are discontinuous in genomic space. Browsers such as UCSC Genome Browser or IGV will render BED12 files appropriately, similar to how transcripts in the genome are displayed.

This format is also well-supported by command-line tools. For example, bedtools provides a -split command-line flag for some subcommands to allow the individual blocks within each line of a BED12 file to be treated independently as needed. This can be useful for calculating intersections, for example, where you may be interested in intersections with the regions covered by the probes themselves rather than intersections with the entire genomic interval the probe coordinates span including intronic regions. bedtools also provides the subcommand bed12tobed6 for conversion of BED12 files to BED6 format -- in the resulting file each probe would appear on multiple lines when spanning one or more splice junctions.

Probe set metadata TSV file

This TSV file lists additional metadata information on gene name and description for the genes targeted by probes. The file contains all the columns from Probe set reference CSV file, as well as information about probe coverage for targeted genes and the transcript ID targeted by the probe. This file does not list the DEPRECATED probes.

The following is a snippet from the human probe metadata TSV file:


probe_id	gene_id	gene_name	gene_description	probe_seq	included	gene_total_coverage_rounds	coverage_round	transcript_id_set	region
ENSG00000000003|TSPAN6|8eab823	ENSG00000000003	TSPAN6	tetraspanin 6	GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC	TRUE	3	1	ENST00000373020;ENST00000612152;ENST00000614008	spliced
ENSG00000000003|TSPAN6|9d7fe51	ENSG00000000003	TSPAN6	tetraspanin 6	TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG	TRUE	3	2	ENST00000373020;ENST00000612152;ENST00000614008	unspliced
[...]

The columns of this file in order are:

Column name	Description
`probe_id`	Probe ID, same as in probe set file
`gene_id`	Gene ID from the probe ID, same as in probe set file
`gene_name`	Gene name from the probe ID (e.g. ACTA1), same as in probe set file
`gene_description`	Long-form gene description from Ensembl (e.g. actin alpha 1, skeletal muscle)
`probe_seq`	Templated portion of probe pair sequence, same as in probe set file
`included`	Included column from probe set file
`gene_total_coverage_rounds`	The total number of coverage rounds that are present for this gene (fold-coverage of all transcripts for that gene within the panel)
`coverage_round`	1, 2, or 3 – The round of coverage to which the probe belongs. Counts from probes belonging to the same coverage round must be added together to get the full gene-level count for that round of coverage. `gene_total_coverage_rounds` is the max of this column within the gene to which this probe is designed.
`transcript_id_set`	Semicolon-separated list of the set of transcripts that this particular probe was designed to cover. There may be transcripts outside of GENCODE basic that are covered but not listed here. Ensembl transcript IDs are used.
`region`	Corresponds to the `region` column of the probe set reference CSV (only in v1.0.1 reference CSV). The gene boundary targeted by the probe. Accepted values are spliced or unspliced.

Probe Barcode sequence TXT file

The Probe Barcode sequences are included as a text file in the Cell Ranger software tarball (i.e., cellranger-7.0.0/lib/python/cellranger/barcodes/translation/probe-barcodes-fixed-rna-profiling.txt) and on the Cell Ranger downloads page.

Each Probe Barcode ID (e.g., BC001) has a total of eight sequences. To ensure balanced base composition during sequencing, each ID has a distinct mixture of four sequences (first four sequences listed in the text file for each ID). The remaining four sequences for each ID are included to account for potential base deletion in bases 1-68 on the R2 read during sequencing.

The first column is the actual sequence of the Probe Barcode, the second column is the sequence after translation in Cell Ranger analysis (only applies to multiplex workflows), and the third column is the Probe Barcode ID. The following snippet shows the eight sequences associated with BC001 and with BC002:


ACTTTAGG        ACTTTAGG        BC001
CTTTAGGC        ACTTTAGG        BC001
CGAGGGTA        ACTTTAGG        BC001
GAGGGTAC        ACTTTAGG        BC001
GACACTAC        ACTTTAGG        BC001
ACACTACC        ACTTTAGG        BC001
TTGCACCT        ACTTTAGG        BC001
TGCACCTC        ACTTTAGG        BC001
AACGGGAA        AACGGGAA        BC002
ACGGGAAC        AACGGGAA        BC002
CGAATTGC        AACGGGAA        BC002
GAATTGCC        AACGGGAA        BC002
GTTCCATT        AACGGGAA        BC002
TTCCATTC        AACGGGAA        BC002
TCGTACCG        AACGGGAA        BC002
CGTACCGC        AACGGGAA        BC002
...