HOME  ›   pipelines

# Gene Expression Algorithms Overview

## Alignment

A full length cDNA construct is flanked by the 30 bp template switch oligo (TSO) sequence, AAGCAGTGGTATCAACGCAGAGTACATGGG, on the 5' end and poly-A on the 3' end. Some fraction of sequencing reads are expected to contain either or both of these sequences, depending on the fragment size distribution of the sequencing library. Reads derived from short RNA molecules are more likely to contain either or both TSO and poly-A sequence than longer RNA molecules.

Since the presence of non-template sequence in the form of either template switch oligo (TSO) or poly-A, low-complexity ends confound read mapping, TSO sequence is trimmed from the 5' end of read 2 and poly-A is trimmed from the 3' end prior to alignment. Trimming improves the sensitivity of the assay as well as the computational efficiency of the software pipeline.

Tags ts:i and pa:i in the output BAM files indicate the number of TSO nucleotides trimmed from the 5' end of read 2 and the number of poly-A nucleotides trimmed from the 3' end. The trimmed bases are present in the sequence of the BAM record, and the CIGAR string shows the position of these soft-clipped sequences.

### Transcript alignment for Fresh Frozen tissues

For fresh-frozen tissues Space Ranger uses STAR to perform splicing-aware alignment of transcript reads to the genome. This is the default algorithm unless the --probe-set option is invoked for FFPE tissues, see below. After alignment, Space Ranger uses the transcript annotation GTF file to count each read as either exonic, intronic, or intergenic. Space Ranger counts a read as exonic if at least 50% of it intersects an exon, and as intronic if > 50% of its bases map to a gene but not to one of that gene’s exons. Otherwise, the read is counted as intergenic.

For reads that align to a single exonic locus, but also align to one or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus with mapping quality (MAPQ) of 255. All uniquely mapping reads have a MAPQ of 255. For multi-mapping reads, the MAPQ score is defined as $$MAPQ = int(-10*log_{10}(1-\frac{1}{N_{map}}))$$

where ${N_{map}}$ is the number of loci a read can map to. BAM tag NH:i gives the value of ${N_{map}}$ which is 1 for uniquely mapped reads and > 1 for multi mapped reads.

Space Ranger further aligns exonic reads to annotated transcripts, looking for compatibility. A read that has bases 100% compatible with the exons of an annotated transcript, and aligned to the same strand, is considered mapped to the transcriptome. Space Ranger ignores antisense reads that are defined as any read with alignments to an entire gene on the opposite strand and no sense alignments. If the read is compatible with a single gene annotation, it is considered uniquely (confidently) mapped to the transcriptome. These confidently mapped reads are the only ones considered for UMI counting.

### Probe alignment for FFPE tissues

Space Ranger 1.3 introduces support for formalin fixed paraffin embedded (FFPE) tissues. During the FFPE workflow, whole transcriptome probe panels, consisting of a pair of probes for each targeted gene, are added to the tissue. These probe pairs hybridize to their target transcript and are then ligated together.

To analyze FFPE data it is necessary to use the --probe-set option to specify a probe set reference CSV file. When this option is invoked, Space Ranger will count ligation events using the probe aligner algorithm. Reads are also aligned to the reference transcriptome using STAR, but only to determine their alignment positions and CIGAR strings; STAR alignments are not used to assign reads to genes for FFPE data. Sequencing reads are aligned to the probe set reference and assigned to the genes they target. The probe alignment algorithm is similar to a seed-and-extend aligner, where each half of the read is a seed, and is described in detail below.

1. Build an index of the half-probe sequences in the probe set reference CSV.
2. For each read, look up each read half in this index, allowing up to one mismatch and no indels.
3. If both halves of the read map to the same probe ID, the read is confidently mapped with a MAPQ (mapping quality) of 255. Only reads with a mapping quality value of 255 contribute to UMI counts.
4. If one half maps and the other half does not map, compare the sequence of the unmapped half to the expected sequence for that probe, allowing mismatches but not indels. If it matches exceeding the minimum alignment score (30, scoring +1 for a match and -1 for a mismatch), the read is confidently mapped.
5. If the unmapped half of the read does not match the expected sequence for that probe, the read is half-mapped with a MAPQ of 1, and does not contribute to UMI counts.
6. If both halves of the probe map but to different probes, the read is ambiguously mapped with a MAPQ of 3, and does not contribute to UMI counts.
7. When neither half of the probe maps, the read is unmapped with a MAPQ of 0.
8. All reads with up to three mismatches are guaranteed to align.

The BAM tag pr:Z reports a semicolon-separated list of probe IDs. See BAM Alignment Tags for a detailed description.

### Probe filtering for FFPE tissues

The probe reference is filtered to remove genes/features where one or more of the probes targeting this feature might hybridize and ligate at non-targeted loci. Probes that are predicted to have off-target activity to homologous genes or sequences are excluded from analysis by default. These probes are marked with FALSE in the included column of the probe set reference CSV. Any gene that has at least one probe with predicted off-target activity will be excluded from filtered outputs. Setting the --filter-probes=true command line argument of spaceranger count will result in UMI counts from all non-deprecated probes, including those with predicted off-target activity, to be used in the analysis. Probes whose ID is prefixed with DEPRECATED are always excluded from the analysis.

### 10x Barcode correction

To determine whether barcode sequence is correct, Space Ranger compares 10x Barcodes to the known barcodes for a given assay chemistry, which are stored in a barcode whitelist file. For example, there are 4992 barcodes in the whitelist for Visium slides with 6.5 mm Capture Area and 14336 barcodes for Visium slides with 11 mm Capture Area.

Space Ranger uses the following algorithm to correct putative barcode sequences against the whitelist:

1. Count the observed frequency of every barcode on the whitelist in the dataset.
2. For every observed barcode in the dataset that is not on the whitelist and is at most one Hamming distance away from the whitelist sequences:
• Compute the posterior probability that the observed barcode did originate from the whitelist barcode but has a sequencing error at the differing base (by base quality score).
• The corrected barcodes are used for all downstream analysis and output files.

In the output BAM file, the original uncorrected barcode is encoded in the CR tag, and the corrected barcode sequence is encoded in the CB tag. Reads that cannot be assigned a corrected barcode will not have a CB tag.

### UMI counting

Before counting UMIs, Space Ranger attempts to correct for sequencing errors in the UMI sequences. Reads that were confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene, but their UMIs differ by a single base (i.e., are Hamming distance 1 apart), then one of the UMIs was likely introduced by a substitution error in sequencing. In this case, the UMI of the less-supported read group is corrected to the UMI with higher support.

Space Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.

After these two filtering steps, each observed barcode, UMI, gene/feature combination is recorded as a UMI count in the unfiltered feature-barcode matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.

## Detecting tissue barcodes

Space Ranger detects spots under the tissue section in the Imaging subpipeline. Only the barcodes associated to these under tissue spots are captured in the Filtered feature-barcode matrix and are used for downstream analyses.

## Secondary analysis of gene expression

Space Ranger uses the feature barcode matrix and the spatial location data from the image to generate secondary analysis results including clusters, t-SNE and UMAP projections as well as differential gene expression between the clusters. All of the secondary data are recorded in the cloupe.cloupe file which can be visualized in Loupe Browser.

### Dimensionality reduction

#### PCA

In order to reduce the gene expression matrix to its most important features, Space Ranger uses Principal Components Analysis (PCA) to change the dimensionality of the dataset from (spots x genes) to (spots x M) where M is 10. The pipeline uses a python implementation of IRLBA algorithm, (Baglama & Reichel, 2005), which is modified to reduce memory consumption.

#### t-SNE

For visualizing data in 2-d space, Space Ranger passes the PCA-reduced data into t-Stochastic Neighbor Embedding (t-SNE), a nonlinear dimensionality reduction method (Van der Maaten, 2014). The C++ reference implementation by Van der Maaten (2014) was modified to take a PRNG seed for determinism. The runtime is also decreased by fixing the number of output dimensions at compile time to two or three. The t-SNE is visualized in both web_summary.html and in Loupe Browser.

#### UMAP

Space Ranger also supports Uniform Manifold Approximation and Projection (UMAP), which estimates a topology of the high dimensional data and uses this information to estimate a low dimensional embedding that preserves relationships between datapoints (McInnes & Healy, 2018). The pipeline uses the python implementation of this algorithm by McInnes & Healy (2018). UMAP coordinates are available in the pipeline output, but not displayed in the web_summary.html.

### Clustering

Space Ranger uses two different methods for clustering spots by expression similarity, both of which operate in the PCA representation.

#### Graph-based

The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where spots are linked if they are among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization (LMO; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of spots. An additional cluster-merging step is done: Perform hierarchical clustering on the cluster-medoids in PCA space and merge pairs of sibling clusters if there are no genes differentially expressed between them (with B-H adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge.

The use of LMO to cluster spots was inspired by a similar method in the R package Seurat.

#### K-means clustering

Space Ranger also performs traditional K-means clustering across a range of K values, where K is the preset number of clusters.

### Differential expression

In order to identify genes whose expression is specific to each cluster, Space Ranger tests, for each gene and each cluster, whether the in-cluster mean differs from the out-of-cluster mean.

In order to find differentially expressed genes between groups of spots, Space Ranger uses the quick and simple method sSeq (Yu, Huber, & Vitek, 2013), which employs a negative binomial exact test. When the counts become large, Space Ranger switches to the fast asymptotic beta test used in edgeR (Robinson & Smyth, 2007). For each cluster, the algorithm is run on that cluster versus all other spots, yielding a list of genes that are differentially expressed in that cluster relative to the rest of the sample.

Space Ranger's implementation differs slightly from that in the paper. In the sSeq paper, the authors recommend using DESeq's geometric mean-based definition of library size (Love, Huber & Anders, 2014). Space Ranger instead computes relative library size as the total UMI counts for each spot divided by the median UMI counts per spot. As with sSeq, normalization is implicit in that the per-spot library-size parameter is incorporated as a factor in the exact-test probability calculations.

### Spatial enrichment

Space Ranger quantifies spatial enrichment, measured by Moran's I, as a discovery tool that can be useful to identify features that have distinct patterns of expression (spatial autocorrelation). Moran’s I is not related to differential expression as it is independent from any clustering information. Features with similar Moran's I do not necessarily have similar spatial expression patterns although methods do exist for identifying groups of genes with similar spatial enrichment patterns. The Moran's I metric scale ranges from -1 (perfectly dispersed) to 1 (perfectly enriched).

## References

Baglama, J. & Reichel, L., Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal on Scientific Computing 27, 19–42 (2005).

Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, (2008).

Love, M. L., Huber, W. & Anders, S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, number 550 (2014).

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv (2018).

Robinson, M. D. & Smyth, G. K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332 (2007).

Van der Maaten, L., Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15, 3221-3245 (2014).

Yu, D., Huber, W. & Vitek, O., Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29, 1275–1282 (2013).