Software  ›   pipelines

# Targeted Spatial Gene Expression Algorithms Overview

## Targeted Subsampling

For Targeted Spatial Gene Expression, subsampling is performed after barcode assignment and prior to alignment. This subsampling is enabled only for sequencing libraries with high depth, exceeding 15,000 reads under tissue per spot, in which spurious molecules can be observed in a small fraction of reads. The desired depth limit can be specified using the argument --rps-limit=N or disabled using --rps-limit=0, although this is not recommended. The unanalyzed reads are present in the BAM file, unmapped and indicated using the xf:i:32 tag.

## Design of Bait Sequences for Targeted Spatial Gene Expression

### Bait Design Overview

In order to specifically recover molecules from targeted genes of interest in gene expression libraries, a set of bait oligonucleotides are designed with sequence complementarity for each gene in a targeted panel. Baits designed for the Targeted Spatial Gene Expression assay are 120 base pair (bp) biotinylated oligonucleotides that are used within a hybridization-capture workflow.

Targeted Spatial Gene Expression baits were designed with the following broad design principles:

• Compatibility: Targeted baits can be applied to a variety of gene expression technologies, including Visium Spatial Gene Expression libraries and both 3’ and 5’ Chromium single cell RNA-seq libraries.
• Comprehensiveness: Baits are designed to cover the full set of annotated transcripts for each gene in the target panel, without designing redundant baits across overlapping transcript sequences. This strategy also confers added robustness by enabling capture of molecules derived from unexpected or poorly-annotated isoforms of target genes.
• Specificity: Baits are chosen to avoid unwanted complementarity to repetitive elements or other problematic genomic sequences that may compromise assay performance.

### Which Transcripts are Targeted?

Baits are tiled across the full length of all transcripts annotated in the GRCh38-2020-A reference. The transcript annotations in this reference are based on the GENCODE comprehensive set. See release notes for documentation on this new reference.

### Tiling Approach

For each transcript within a gene, bait sequences are designed starting at the 5’ and 3’ ends and progresses inwards until the center of the transcript is reached. Baits are typically 120bp in length (see the Optimizing Specificity section for more details on exceptions) and are aimed to tile at 1x coverage. If the transcript length is not a multiple of 120bp, a small coverage gap appears in the center of the transcript (away from the annotated 3’ or 5’ ends). In order to prevent the unnecessary selection of redundant baits from regions of identical sequence shared across transcripts from the same gene, a De Bruijn graph data structure is used. By querying the De Bruijn graph constructed using all 120bp subsequences derived from each transcript, the algorithm ensures that any potential new bait sequence is rejected if it would be redundant with an existing previously-designed bait from the same target gene.

A subtle consequence of this procedure is that the ordering of transcripts can, to some extent, impact the exact set of baits designed for a given gene. We utilize the rankings provided by the APPRIS database (Annotating principal splice isoforms) to order transcripts such that higher confidence transcript annotations are tiled first. Other measures are taken to make sure ordering is deterministic when ties occur in categorizations from the APPRIS.

### Introns

Introns are not included in our tiling of annotated transcripts, only UTRs and coding regions. As a consequence, intronic reads from targeted genes will be mostly depleted in Targeted Spatial Gene Expression libraries.

### Optimizing Specificity

There are many repetitive elements in the transcriptome that, if overlapped by baits, could result in reduced enrichment in Targeted Spatial Gene Expression libraries, particularly within large target gene panels. Before designing baits, K-mers with many near-identical occurrences in the genome are identified so that they can be avoided during bait design.

If a bait intended to be designed at a given position contains one of the K-mers above, a combination of shifting the bait position up or downstream and removing portions of the bait sequence is used to attempt to retain a functional bait that will not overlap the problematic sequence. If this is not possible, the algorithm will not design a bait at this location.

Approximately 2-3% of baits within the pre-designed panels are shorter than 120bp because of the procedure above.

### Bait Designs for Custom Sequences

Via the 10x Genomics Custom Panel Designer users can also design baits for entirely custom exogenous sequences. These sequences follow the same design procedure as endogenous genes, including the procedure described in the Optimizing Specificity section. Each custom sequence is treated independently. No check for redundancy is performed between submitted sequences, as would normally occur for transcripts within the same gene. Similarly, no check for further homology between submitted sequences and endogenous sequences is performed.

### Identifying Genes with Low Mappability or Low Bait Coverage

In order to assign a UMI to a particular gene, Space Ranger has to be able to confidently identify which gene a given read came from. A small number of genes share a substantial proportion of their sequence content with other genes in the genome, which in extreme cases makes it difficult to assign the corresponding UMI count to a single gene. These genes may have reduced counts in the resulting expression matrices.

The bait design algorithm intentionally avoids designing baits that overlap repetitive sequences as described in the Optimizing Specificity section above. A very small number of genes have low coverage of baits around annotated transcript ends due to the presence of repetitive sequences. These genes are flagged with poor coverage within the last 360bp of the 3’ or 5’ ends of annotated transcripts, where the majority of signal for most genes is expected to be localized in 3'/5' gene expression libraries. While the baits that are provided for these genes may work well, enrichment of these genes may be less robust depending on the particular sample and location of library molecules within annotated transcripts.

Genes with very low mappability or low bait coverage at annotated transcript ends are noted using the mappability_flag column in the gene metadata file provided for predesigned and customized panels. These genes will also trigger a warning if added to customized panels via the 10x Genomics Custom Panel Designer.

### Reviewing Genes with Mappability Warnings

If a gene you want to add to your panel has triggered a mappability warning (or equivalently has the value TRUE in the mappability_flag column of the gene metadata file), the following steps are recommended:

1. Check if this gene has near-zero counts in existing whole transcriptome datasets. If so, this warning may indicate the gene is very poorly mappable and Space Ranger is not able to assign many reads to this gene. Targeting this gene may not yield the desired results.

2. If you have existing whole transcriptome data, you can use IGV or another genome browser to view coverage within this gene alongside the bait locations (as defined in the panel's BED12 file). If the baits appear to cover the regions where reads have aligned, then enrichment is likely to work well. If baits do not overlap the main regions of coverage, enrichment may not work well for this gene and you may want to consider whether to include this gene on your panel. See our documentation on BED12 files for more information.

## Modeling Targeted Spatial Gene Enrichments (count)

In a Targeted Spatial Gene Expression sample, it is assumed that there are two classes of genes: genes from the target panel that are enriched by targeting, and non-targeted genes that constitute the background. Since the number of reads per gene in the sample before targeting is not known, standard approaches to directly calculate read enrichments are not available. Instead, the mean number of reads per UMI (or Mean Reads per UMI) for each gene is calculated, which serves as a proxy for read enrichment. The mean number of reads per UMI for a gene is closely related to the sequencing saturation for that gene (mean reads per UMI = 1/(1 - sequencing saturation), also see Metrics). Given there are a finite number of UMIs per gene in a given sample, enrichment will tend to recover more UMIs from those genes as well as more PCR duplicates arising from those UMIs. Therefore enriched genes are likely to have greater sequencing saturation and greater mean reads per UMI.

In order to assess whether targeted genes were enriched in a given sample, the mean reads per UMI values are first calculated for all genes (both targeted and non-targeted genes). Only UMIs in tissue-associated barcodes are used and only genes with at least 10 UMIs in tissue-associated barcodes are considered for this analysis. Space Ranger then fits a two-component Gaussian Mixture Model to the mean reads per UMI distribution, grouping genes into one of two classes: enriched genes (those with higher mean reads per UMI) and non-enriched genes (those with lower mean reads per UMI). The numbers of targeted and non-targeted genes that are considered enriched by this method are shown in the Targeted Spatial Gene Expression Run Summary Targeted Enrichment dashboard.

Space Ranger may sometimes be unable to confidently assign genes as enriched or non-enriched using this method, particularly when there are too few targeted and/or non-targeted genes meeting the criteria above, or if sequencing saturation is too low. In these rare cases, enrichment calculations may be disabled, and the metrics reporting the number of genes enriched will be N/A.

## Modeling Targeted Spatial Gene Enrichments (Targeted-Compare)

The computation of gene enrichments in targeted-compare is analogous to that described above, with one modification. When running targeted-compare, the parent sample read counts are known, allowing for direct calculation of read enrichments. Space Ranger therefore uses the same Gaussian Mixture Model described above, substituting Mean Reads per UMI with Read Enrichments. Read Enrichments per Gene are calculated as (Number of reads in the Targeted Sample)/(Number of reads in the Parent Sample), after rescaling both samples to the same number of read pairs from Gene Expression libraries.