Software  ›   pipelines

# Peak Annotations

## How a peak is annotated to genes

Peaks are mapped to gene based on the genomic location of the nearby gene. The general principle is as follows:

• The goal of peak annotation is to map peak to gene symbols, which is the union of all transcripts of a given gene.
• A peak can be mapped to multiple genes.
• A peak can only be one type of peaks for a given gene, which means a peak cannot be annotated as both a promoter peak and a distal peak of the same gene.
• Only protein coding genes are included for annotation.

The annotation procedure is as follows:

1. If a peak overlaps with promoter region (-1000bp, +100bp) of any TSS, it is annotated as a promoter peak of the gene.
2. If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
3. If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
4. If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.

## Annotation output file

The output file of peak annotation is peak_annotation.tsv. It has the following format:

Column NumberNameDescription
1peakLocation of peak, denoted as "contig_start_end".
2geneGene symbol based on the gene annotation in the reference.
3distanceDistance of peak from TSS of gene. Positive distance means the start of the peak is downstream of the position of the TSS, whereas negative distance means the end of the peak is upstream of the TSS. Zero distance means the peak overlaps with the TSS or the peak overlaps with the transcript body of the gene.
4peak_typeCan be "promoter", "distal" or "intergenic".
Below is an example of a subsection of a peak_annotation.tsv. Note that for a given peak, entries for gene, distance and peak_type are ";"-separated and sorted in the same way. For example, for peak chr1_145253531_145253882, the entries are: NOTCH2NL;RP11-458D21.5 28933;0 distal;distal. This is parsed as - the peak is distal to gene NOTCH2NL at a distance of 28kbp, but distal to gene RP11-458D21.5 by being present in the gene body.
peak    gene    distance    peak_type
chr1_144529748_144530140    PPIAL4B -165503 distal
chr1_144533459_144534494    PPIAL4B -169214 distal
chr1_144535921_144536510    PPIAL4B -171676 distal
chr1_144593494_144593902            intergenic
chr1_144594466_144594608            intergenic
chr1_144907978_144908683    PDE4DIP 23465   distal
chr1_144917975_144918388    PDE4DIP 13760   distal
chr1_144930729_144932952    PDE4DIP 0   promoter
chr1_144935233_144935903    PDE4DIP -2682   distal
chr1_145021465_145021812    PDE4DIP 17959   distal
chr1_145029934_145030407    PDE4DIP 9364    distal
chr1_145039179_145040353    PDE4DIP 0   promoter
chr1_145042909_145043074    PDE4DIP -2908   distal
chr1_145058730_145059385    PDE4DIP 16497   distal
chr1_145075570_145075775    PDE4DIP 107 distal
chr1_145090390_145090497    SEC22B  -5916   distal
chr1_145096097_145096897    SEC22B  0   promoter
chr1_145114341_145114882    SEC22B  17929   distal
chr1_145129664_145130160    CH17-478G19.1   9453    distal
chr1_145138857_145139783    CH17-478G19.1   0   promoter
chr1_145208754_145210260    NOTCH2NL;RP11-458D21.5  0;0 promoter;promoter
chr1_145253531_145253882    NOTCH2NL;RP11-458D21.5  28933;0 distal;distal
chr1_145293232_145293711    NBPF10;RP11-458D21.5    0;0 promoter;distal
chr1_145382220_145383308    HFE2    -14942  distal
chr1_145395396_145399610    HFE2    0   promoter
chr1_145421600_145421979    HFE2    8322    distal


## Processing the peak annotation file in R

The peak annotation file can be used for custom analysis, such as plotting peak-gene relationship or generating gene activity score from peaks. Here we provide some examples of loading the peak annotation file and converting it to various data structures.

library(tidyverse)
library(Matrix)
peak_annotation_file <- "/opt/sample345/outs/peak_annotation.tsv"
# direct loading of the original format
df_peakanno <- readr::read_tsv(peak_annotation_file)
# separate each row into a single peak-gene-type combination, i.e. split by ";"
df_peakanno <- readr::read_tsv(peak_annotation_file) %>%
tidyr::separate_rows(gene, distance, peak_type, sep = ';')
# Convert to a sparse binary matrix of peak-gene mapping relationship
# the order of the peaks is the same as the peak-barcode matrix in the pipeline output
# the order of the genes is alphanumeric
sparseMatrix_peakanno <- readr::read_tsv(peak_annotation_file) %>%
dplyr::mutate(peak = factor(peak, levels = peak)) %>%
tidyr::separate_rows(gene, distance, peak_type, sep = ';') %>%
dplyr::filter(!is.na(gene)) %>%
# can also add extra filter here using dplyr::filter()
# such as restricting peaks to promoters only or within a certain distance to TSS
dplyr::mutate(gene = factor(gene)) %>%
dplyr::group_by(peak, gene) %>%
dplyr::summarise(value = as.integer(n() > 0)) %>%
stats::xtabs(value ~ peak + gene, data = ., sparse = T)

• 1.1
• 1.0
• Cell Ranger ATAC v1.2 (latest)