# Descriptions of Probe Set Reference CSV and Supporting Files

The Probe Sets Overview includes the following files:

FileDescription
Probe set reference CSV file This CSV file is a required input for Space Ranger to enable analysis of Visium FFPE data. It specifies the probe sequences used for probe alignment.
Probe set BED file A BED12 file that contains the sequences and genomic coordinates of the probes. This file can be used to visualize the probe locations in a genome browser and intersect probe locations with other data sources.
Probe off-target activity CSV A CSV file that lists probes with predicted off-target activity, excluded from analysis by default.

## Probe Identifiers

Files containing information about individual probes have a column corresponding to the probe identifier (ID) that uniquely identifies each probe. Probe IDs take the following format:

gene_id|gene_name|probe_sequence_hash


For example, the probe for the gene TSPAN6 in the human whole transcriptome probe set, which has the Ensembl gene ID ENSG00000000003 in the GRCh38-2020-A reference, has the probe ID ENSG00000000003|TSPAN6|41ef80c.

A small number of probes whose ID includes the prefix DEPRECATED are excluded from analysis by default.

### Probe Set Reference CSV File

This CSV file is a required input for Space Ranger to enable analysis of Visium FFPE data. It specifies the sequences used as a reference for probe alignment and the gene ID associated with each probe. See a description of the --probe-set argument to spaceranger count in the spaceranger count documentation.

The following snippet is an example from a probe set reference CSV file:

#probe_set_file_format=1.0
#panel_name=Visium Human Transcriptome Probe Set
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
gene_id,probe_seq,probe_id,included
ENSG00000000003,ATCTT[...]TGCTT,ENSG00000000003|TSPAN6|41ef80c,TRUE
ENSG00000000005,ATGAC[...]AGTAA,ENSG00000000005|TNMD|f11e5fc,TRUE
ENSG00000000419,TTGTA[...]TTCCT,ENSG00000000419|DPM1|73ef065,TRUE
ENSG00000000457,CTTGA[...]GGAAT,ENSG00000000457|SCYL3|e327340,TRUE
[ ... ]


The columns of this file are:

Column NameDescription
gene_id The Ensembl gene identifier targeted by the probe.
probe_seq The nucleotide sequence of the probe, which is complementary to the transcript sequence.
probe_id The probe identifier, whose format is described above.
included A TRUE/FALSE flag specifying whether the probe is included in the filtered counts matrix output or excluded by the probe filter. See --no-probe-filter command line argument of spaceranger count. All probes of a gene must be marked TRUE in the included column for that gene to be included.

The file also contains a number of required metadata fields in the header in the format #key=value:

panel_name The name of the probe set.
panel_type Always predesigned for predesigned probe sets.
reference_genome The reference genome build used for probe design.
reference_version The version of the Space Ranger reference transcriptome used for probe design.
probe_set_file_format The version of the probe set file format specification that this file conforms to.

### Probe Off-target Activity CSV File

This CSV file lists probes with predicted off-target activity identified by alignment to the reference transcriptome.

The following snippet is an example of a probe off-target activity CSV file:

probe_id,off_target_genes
ENSG00000004455|AK2|9f34385,ENSG00000185839|AL035411.1;ENSG00000242272|AK2P2
ENSG00000004478|FKBP4|ed8be23,ENSG00000251463|FKBP4P1
ENSG00000005022|SLC25A5|e0d84d7,ENSG00000225347|SLC25A5P8;ENSG00000213673|SLC25A5P3;ENSG00000226421|SLC25A5P5;ENSG00000235064|SLC25A5P2;ENSG00000213332|SLC25A5P6;ENSG00000251078|SLC25A5P9
ENSG00000005075|POLR2J|e3cf54f,ENSG00000228049|POLR2J2;ENSG00000285437|POLR2J3;ENSG00000272655|POLR2J4
ENSG00000006625|GGCT|dea3f5e,ENSG00000232943|AL050321.1
ENSG00000006756|ARSD|0fd6410,ENSG00000225117|ARSDP1
ENSG00000008128|CDK11A|2d9087f,ENSG00000248333|CDK11B
ENSG00000008324|SS18L2|cc126ff,ENSG00000232525|SS18L2P1
[ ... ]


The columns for this file are:

Column NameDescription
probe_id The ID of the probe with predicted off-target activity.
off_target_genes A semicolon separated list of predicted off-target genes. For each off-target gene, the Ensembl gene ID and gene symbol are separated by a vertical bar.

### Probe BED File

A BED12-formatted file that contains the sequences and genomic coordinates of the probes. This file may be used to visualize the probe locations with genome browsers like IGV (Integrated Genomics Viewer) and the UCSC Genome Browser or to intersect the probe locations with other genomic features of interest using tools like Bedtools.

The following snippet is from an example BED12 file:

chr1    69519   69569   ENSG00000186092|OR4F5|c4da86d   0   -   69519   69569   0   1   50  0
chr1    925956  926006  ENSG00000187634|SAMD11|87d23c4  0   -   925956  926006  0   1   50  0
chr1    958972  959022  ENSG00000188976|NOC2L|6b84612   0   +   958972  959022  0   1   50  0
chr1    963955  964005  ENSG00000187961|KLHL17|da46e9a  0   -   963955  964005  0   1   50  0
chr1    970295  970345  ENSG00000187583|PLEKHN1|848db4f 0   -   970295  970345  0   1   50  0
chr1    979664  979714  ENSG00000187642|PERM1|2aaf487   0   +   979664  979714  0   1   50  0
chr1    999353  999403  ENSG00000188290|HES4|cbe069d    0   +   999353  999403  0   1   50  0
chr1    1014261 1014311 ENSG00000187608|ISG15|8b560b9   0   -   1014261 1014311 0   1   50  0
chr1    1043324 1043374 ENSG00000188157|AGRN|f06ab24    0   -   1043324 1043374 0   1   50  0
chr1    1072173 1072223 ENSG00000237330|RNF223|522e0bc  0   +   1072173 1072223 0   1   50  0


The columns of BED12 files we provide are as follows (adapted from UCSC Genome Browser documentation):

Column NameDescription
chromosome Chromosome of the target gene.
chromStart 0-based start coordinate of the targeted sequence on the chromosome.
chromEnd 0-based non-inclusive end coordinate on the chromosome.
name probe ID as described above.
score Set to 0 for all entries.
strand + or - to indicate the strand of the targeted gene.
thickStart The starting position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
thickEnd The ending position at which the feature is drawn as a thick line in browsers (matches display of the corresponding transcript region).
itemRgb Set to 0 for all entries.
blockCount The number of blocks (continuous intervals).
blockSizes Comma-separated list of the block sizes, contains blockCount entries.
blockStarts Comma-separated list of block starts relative to chromStart column, contains blockCount entries.

The BED12 format was chosen because it allows probes that span splice junctions to be conveniently represented on a single line and allows genome browsers to visualize links between regions of probes that are discontinuous in genomic space. Browsers such as UCSC Genome Browser or IGV will render BED12 files appropriately, similar to how transcripts in the genome are displayed.

This format is also well-supported by command-line tools. For example, bedtools provides a -split command-line flag for some subcommands to allow the individual blocks within each line of a BED12 file to be treated independently as needed. This can be useful for calculating intersections, for example, where you may be interested in intersections with the regions covered by the probes themselves rather than intersections with the entire genomic interval the probe coordinates span including intronic regions. bedtools also provides the subcommand bed12tobed6 for conversion of BED12 files to BED6 format -- in the resulting file each probe would appear on multiple lines when spanning one or more splice junctions.