Probe Set Reference CSV and Supporting Files - Official 10x Genomics Support

The Probe Sets Overview includes the following files:

File	Description
Probe set reference CSV file	This CSV file is a required input for Space Ranger to enable analysis of Visium FFPE data. It specifies the probe sequences used for probe alignment.
Probe off-target activity CSV	A CSV file that lists probes with predicted off-target activity, excluded from analysis by default.
Probe set BED file	A BED12 file that contains the sequences and genomic coordinates of the probes. This file can be used to visualize the probe locations in a genome browser and intersect probe locations with other data sources.
Probe set metadata TSV file	A TSV file that lists probes with additional information about gene name and description.

Probe identifiers

Files containing information about individual probes have a column corresponding to the probe identifier (ID) that uniquely identifies each probe. Probe IDs take the following format:


gene_id|gene_name|probe_sequence_hash

For example, the probe for the gene TSPAN6 in the human whole transcriptome probe set, which has the Ensembl gene ID ENSG00000000003 in the GRCh38-2020-A reference, has the probe ID ENSG00000000003|TSPAN6|41ef80c.

A small number of probes whose ID includes the prefix DEPRECATED are always excluded from analysis. Not all reagent lots contain these deprecated probes.

File formats for probe set downloads

Probe set reference CSV file

This CSV file is a required input for Space Ranger to enable analysis of Visium FFPE data. It specifies the sequences used as a reference for probe alignment and the gene ID associated with each probe. This file is specified using the --probe-set argument to spaceranger count pipeline.

The following snippet is an example from a v1 probe set reference CSV file:


#probe_set_file_format=1.0
#panel_name=Visium Human Transcriptome Probe Set
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
gene_id,probe_seq,probe_id,included
ENSG00000000003,ATCTT[...]TGCTT,ENSG00000000003|TSPAN6|41ef80c,TRUE
ENSG00000000005,ATGAC[...]AGTAA,ENSG00000000005|TNMD|f11e5fc,TRUE
ENSG00000000419,TTGTA[...]TTCCT,ENSG00000000419|DPM1|73ef065,TRUE
ENSG00000000457,CTTGA[...]GGAAT,ENSG00000000457|SCYL3|e327340,TRUE
[ ... ]

In the v2 probe set, many genes have 3-fold coverage i.e three probes per genes. An additional column called region was added to the v2 probe set reference CSV, the values for which can be either spliced or unspliced.


#probe_set_file_format=2.0
#panel_name=Visium Human Transcriptome Probe Set v2.0
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
gene_id,probe_seq,probe_id,included,region
ENSG00000000003,GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC,ENSG00000000003|TSPAN6|8eab823,TRUE,spliced
ENSG00000000003,TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG,ENSG00000000003|TSPAN6|9d7fe51,TRUE,unspliced
ENSG00000000003,AAAGCTGTTCTTAATCTCATGTCTGAAAACAAATCCTACGATGGCAGCGA,ENSG00000000003|TSPAN6|d2b5833,TRUE,spliced
ENSG00000000005,CGTGACGGGTCTTCTCTACTTTCACTTGAGGGACCACCCACTGTTCATTT,ENSG00000000005|TNMD|7790621,TRUE,unspliced
ENSG00000000005,GCCTCGACGGCAGTAAATACAACAATAACCTCTCTCATCCAGCATGGGAT,ENSG00000000005|TNMD|923f04b,TRUE,unspliced
[ ... ]

The columns of this file are:

Column Name	Description
`gene_id`	The Ensembl gene identifier targeted by the probe.
`probe_seq`	The nucleotide sequence of the probe, which is complementary to the transcript sequence.
`probe_id`	The probe identifier, whose format is described in Probe Identifier.
`included`	A `TRUE`/`FALSE` flag specifying whether the probe is included in the filtered counts matrix output or excluded by the probe filter. See `--no-probe-filter` command line argument of spaceranger count. All probes of a gene must be marked `TRUE` in the `included` column for that gene to be included.
`region`	Present only in v2 probe set reference CSV. The gene boundary targeted by the probe. Acceptable values are `spliced` or `unspliced`.

The file also contains a number of required metadata fields in the header in the format #key=value:

Metadata Field	Description
`panel_name`	The name of the probe set.
`panel_type`	Always `predesigned` for predesigned probe sets.
`reference_genome`	The reference genome build used for probe design.
`reference_version`	The version of the Space Ranger reference transcriptome used for probe design.
`probe_set_file_format`	The version of the probe set file format specification that this file conforms to.

Probe off-target activity CSV file

This CSV file lists probes with predicted off-target activity identified by alignment to the reference transcriptome.

The following snippet is an example of a probe off-target activity CSV file:


$ column -s, -t < Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.offtarget.csv | less -S

probe_id                             off_target_genes
ENSG00000004478|FKBP4|ed8be23        ENSG00000235256|FKBP4P7;ENSG00000251463|FKBP4P1;ENSG00000268234|FKBP4P6;ENSG00000269692|FKBP4P2;ENSG00000276457|FKBP4P8
ENSG00000005302|MSL3|46ea040         ENSG00000224287|MSL3P1;ENSG00000239254|AC009220.2
ENSG00000006015|REX1BD|8367ca6       ENSG00000130766|SESN2
ENSG00000011009|LYPLA2|f905390       ENSG00000228285|LYPLA2P1;ENSG00000236604|LYPLA2P3;ENSG00000269153|LYPLA2P2
ENSG00000029363|BCLAF1|4400f06       ENSG00000248966|BCLAF1P1
ENSG00000048545|GUCA1A|42f457b       ENSG00000287363|AL096814.2
ENSG00000051596|THOC3|2d7334d        ENSG00000170089|AC106795.1
ENSG00000055955|ITIH4|e3e9693        ENSG00000243696|AC006254.1
ENSG00000058673|ZC3H11A|0f1fc9b      ENSG00000257315|ZBED6
ENSG00000065371|ROPN1|f33b60a        ENSG00000114547|ROPN1B
ENSG00000069329|VPS35|bb6ea42        ENSG00000260809|VPS35P1
[ ... ]

The columns for this file are:

Column Name	Description
`probe_id`	The ID of the probe with predicted off-target activity.
`off_target_genes`	A semicolon separated list of predicted off-target genes. For each off-target gene, the Ensembl gene ID and gene symbol are separated by a vertical bar.

Probe BED file

A BED12-formatted file that contains the sequences and genomic coordinates of the probes. This file may be used to visualize the probe locations with genome browsers like IGV (Integrated Genomics Viewer) and the UCSC Genome Browser or to intersect the probe locations with other genomic features of interest using tools like Bedtools.

The following snippet is from an example BED12 file:


chr1	69519	69569	ENSG00000186092|OR4F5|c4da86d	0	-	69519	69569	0	1	50	0
chr1	925956	926006	ENSG00000187634|SAMD11|87d23c4	0	-	925956	926006	0	1	50	0
chr1	958972	959022	ENSG00000188976|NOC2L|6b84612	0	+	958972	959022	0	1	50	0
chr1	963955	964005	ENSG00000187961|KLHL17|da46e9a	0	-	963955	964005	0	1	50	0
chr1	970295	970345	ENSG00000187583|PLEKHN1|848db4f	0	-	970295	970345	0	1	50	0
chr1	979664	979714	ENSG00000187642|PERM1|2aaf487	0	+	979664	979714	0	1	50	0
chr1	999353	999403	ENSG00000188290|HES4|cbe069d	0	+	999353	999403	0	1	50	0
chr1	1014261	1014311	ENSG00000187608|ISG15|8b560b9	0	-	1014261	1014311	0	1	50	0
chr1	1043324	1043374	ENSG00000188157|AGRN|f06ab24	0	-	1043324	1043374	0	1	50	0
chr1	1072173	1072223	ENSG00000237330|RNF223|522e0bc	0	+	1072173	1072223	0	1	50	0

The columns of BED12 files we provide are as follows (adapted from UCSC Genome Browser documentation):

Column Name	Description
`chromosome`	Chromosome of the target gene.
`chromStart`	0-based start coordinate of the targeted sequence on the chromosome.
`chromEnd`	0-based non-inclusive end coordinate on the chromosome.
`name`	probe ID as described above.
`score`	Set to `0` for all entries.
`strand`	`+` or `-` to indicate the strand of the targeted gene.
`thickStart`	The starting position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
`thickEnd`	The ending position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
`itemRgb`	Set to `0` for all entries.
`blockCount`	The number of blocks (continuous intervals).
`blockSizes`	Comma-separated list of the block sizes, contains blockCount entries.
`blockStarts`	Comma-separated list of block starts relative to `chromStart` column, contains blockCount entries.

The BED12 format was chosen because it allows probes that span splice junctions to be conveniently represented on a single line and allows genome browsers to visualize links between regions of probes that are discontinuous in genomic space. Browsers such as UCSC Genome Browser or IGV will render BED12 files appropriately, similar to how transcripts in the genome are displayed.

This format is also well-supported by command-line tools. For example, bedtools provides a -split command-line flag for some subcommands to allow the individual blocks within each line of a BED12 file to be treated independently as needed. This can be useful for calculating intersections, for example, where you may be interested in intersections with the regions covered by the probes themselves rather than intersections with the entire genomic interval the probe coordinates span including intronic regions. bedtools also provides the subcommand bed12tobed6 for conversion of BED12 files to BED6 format -- in the resulting file each probe would appear on multiple lines when spanning one or more splice junctions.

Probe set metadata TSV file

This TSV file lists additional metadata information of gene name and description for the genes targeted by probes. The file contains all the columns from Probe set reference CSV file along with the additional two columns. This file does not list the DEPRECATED probes.

The following is the code and snippet from the v1 human probe metadata TSV file:


$ column -t -s $'\t' Visium_Human_Transcriptome_Probe_Set_v1.0_GRCh38-2020-A.probe_metadata.tsv | less --chop-long-lines

probe_id                                gene_id          gene_name       gene_description                                                                                                            probe_seq                                           included
ENSG00000000003|TSPAN6|41ef80c          ENSG00000000003  TSPAN6          tetraspanin 6                                                                                                               ATCTTGTCTACTGCATGGCTTCTATAATCTCCTGTAGAGTTATACTGCTT  TRUE
ENSG00000000005|TNMD|f11e5fc            ENSG00000000005  TNMD            tenomodulin                                                                                                                 ATGACTCGTCCTCCTTGGTAGCAGTATGGATATGGGTAGTAGCCTAGTAA  TRUE
ENSG00000000419|DPM1|73ef065            ENSG00000000419  DPM1            dolichyl-phosphate mannosyltransferase subunit 1, catalytic                                                                 TTGTAGCGAGTTCCAGAGACAATATCAAAATTACCCTCCTTTTGCTTCCT  TRUE
ENSG00000000457|SCYL3|e327340           ENSG00000000457  SCYL3           SCY1 like pseudokinase 3                                                                                                    CTTGATTTCCAAGGCATAGACTCTTCAGTGAGTGAAAGCAAAGCAGGAAT  TRUE

The following is the code and snippet from the v2 human probe metadata TSV file which includes some additional columns:


$ column -t -s $'\t' Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.probe_metadata.tsv | less --chop-long-lines

probe_id                             gene_id          gene_name    gene_description                                                                                                            probe_seq                                           included  gene_total_coverage_rounds  coverage_round  transcript_id_set
ENSG00000000003|TSPAN6|8eab823       ENSG00000000003  TSPAN6       tetraspanin 6                                                                                                               GGTGACACCACAACAATGCAACGTATTTTGGATCTTGTCTACTGCATGGC  TRUE      3                           1               ENST00000373020;ENST00000612152;ENST00000614008
ENSG00000000003|TSPAN6|9d7fe51       ENSG00000000003  TSPAN6       tetraspanin 6                                                                                                               TCTGCATCTCTCTGTGGAGTACAATCTTCAAGTTTACAGCAACTCTTAGG  TRUE      3                           2               ENST00000373020;ENST00000612152;ENST00000614008
ENSG00000000003|TSPAN6|d2b5833       ENSG00000000003  TSPAN6       tetraspanin 6                                                                                                               AAAGCTGTTCTTAATCTCATGTCTGAAAACAAATCCTACGATGGCAGCGA  TRUE      3                           3               ENST00000373020;ENST00000612152;ENST00000614008
ENSG00000000005|TNMD|7790621         ENSG00000000005  TNMD         tenomodulin                                                                                                                 CGTGACGGGTCTTCTCTACTTTCACTTGAGGGACCACCCACTGTTCATTT  TRUE      3                           1               ENST00000373031
ENSG00000000005|TNMD|ab5ef5a         ENSG00000000005  TNMD         tenomodulin                                                                                                                 AAGGCATGATGACACGACAGATGACTCGTCCTCCTTGGTAGCAGTATGGA  TRUE      3                           2               ENST00000373031

The columns of this file in order are:

Column Name	Description
`probe_id`	Probe identifier, as included in Probe set reference CSV file. The format is described in Probe identifiers.
`gene_id`	The Ensembl gene identifier targeted by the probe, same as in Probe set reference CSV file.
`gene_name`	The official HGNC gene symbol targeted by the probe.
`gene_description`	The official HGNC gene full name targeted by the probe.
`probe_seq`	The nucleotide sequence of the probe, which is complementary to the transcript sequence, same as in Probe set reference CSV file.
`included`	A `TRUE`/`FALSE` flag specifying whether the probe is included in the filtered counts matrix output or excluded by the probe filter and is same as that in Probe set reference CSV file. See `--no-probe-filter` command line argument of spaceranger count. All probes of a gene must be marked `TRUE` in the `included` column for that gene to be included.
`region`	Present only in v2 probe set reference CSV. Region column from the reference `probe_set.csv` and values can be either spliced or unspliced.
`gene_total_coverage_rounds`	Present only in v2 probe set reference CSV. The total number of coverage rounds that are present for this gene (fold-coverage of all transcripts for that gene within the panel).
`coverage_round`	Present only in v2 probe set reference CSV. 1, 2, or 3 – the round of coverage to which the probe belongs. Counts from probes belonging to the same coverage round must be added together to get the full gene-level count for that round of coverage. `gene_total_coverage_rounds` is the max of this of this column within the gene this probe is designed.
`transcript_id_set`	Present only in v2 probe set reference CSV. Semicolon-separated list of the set of transcripts that this particular probe was designed to cover. There may be transcripts outside of GENCODE basic that are covered but not listed here. Ensembl transcript IDs are used.