HOME  ›   pipelines

# Peak Annotations

## How a peak is annotated

Peaks are mapped to one or more genes based on genomic proximity. The general principle is as follows:

• A gene is assigned a location based on it's transcription start site (TSS), which is determined as the span of transcription start sites of "basic" transcripts of the gene. These are transcripts that carry the GTF attribute tag "basic". If a gene has no such transcripts then all transcripts are considered when computing the TSS.
• All genes in the GTF file are used for gene annotation independent of gene type.
• A peak is annotated as "promoter" or "distal" relative to a gene, or as "intergenic"
• A peak can be annotated as related to multiple genes. However, a peak cannot be annotated as both "promoter" and "distal" of the same gene.

The annotation procedure is as follows:

1. If a peak overlaps with promoter region (-1000bp, +100bp) of any TSS, it is annotated as a promoter peak of the gene.
2. If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
3. If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
4. If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.

## Annotation output file

The output file of peak annotation is atac_peak_annotation.tsv. It has the following format:

Column NumberNameDescription
1peakLocation of peak, denoted as "contig_start_end".
2geneGene symbol based on the gene annotation in the reference.
3distanceDistance of peak from TSS of gene. Positive distance means the start of the peak is downstream of the position of the TSS, whereas negative distance means the end of the peak is upstream of the TSS. Zero distance means the peak overlaps with the TSS or the peak overlaps with the transcript body of the gene.
4peak_typeCan be "promoter", "distal" or "intergenic".
Below is an example of a subsection of a atac_peak_annotation.tsv. Note that for a given peak, entries for gene, distance and peak_type are ";"-separated and sorted in the same way. For example, for peak chr1_145253531_145253882, the entries are: NOTCH2NL;RP11-458D21.5 28933;0 distal;distal. This is parsed as - the peak is distal to gene NOTCH2NL at a distance of 28kbp, but distal to gene RP11-458D21.5 by being present in the gene body.
peak    gene    distance    peak_type
chr1_144529748_144530140	PPIAL4B	-165503	distal
chr1_144533459_144534494	PPIAL4B	-169214	distal
chr1_144535921_144536510	PPIAL4B	-171676	distal
chr1_144593494_144593902			intergenic
chr1_144594466_144594608			intergenic
chr1_144907978_144908683	PDE4DIP	23465	distal
chr1_144917975_144918388	PDE4DIP	13760	distal
chr1_144930729_144932952	PDE4DIP	0	promoter
chr1_144935233_144935903	PDE4DIP	-2682	distal
chr1_145021465_145021812	PDE4DIP	17959	distal
chr1_145029934_145030407	PDE4DIP	9364	distal
chr1_145039179_145040353	PDE4DIP	0	promoter
chr1_145042909_145043074	PDE4DIP	-2908	distal
chr1_145058730_145059385	PDE4DIP	16497	distal
chr1_145075570_145075775	PDE4DIP	107	distal
chr1_145090390_145090497	SEC22B	-5916	distal
chr1_145096097_145096897	SEC22B	0	promoter
chr1_145114341_145114882	SEC22B	17929	distal
chr1_145129664_145130160	CH17-478G19.1	9453	distal
chr1_145138857_145139783	CH17-478G19.1	0	promoter
chr1_145208754_145210260	NOTCH2NL;RP11-458D21.5	0;0	promoter;promoter
chr1_145253531_145253882	NOTCH2NL;RP11-458D21.5	28933;0	distal;distal
chr1_145293232_145293711	NBPF10;RP11-458D21.5	0;0	promoter;distal
chr1_145382220_145383308	HFE2	-14942	distal
chr1_145395396_145399610	HFE2	0	promoter
chr1_145421600_145421979	HFE2	8322	distal


## Processing the peak annotation file in R

The peak annotation file can be used for custom analysis, such as plotting peak-gene relationship or generating gene activity score from peaks. Here we provide some examples of loading the peak annotation file and converting it to various data structures.

library(tidyverse)
library(Matrix)

peak_annotation_file <- "/opt/sample345/outs/atac_peak_annotation.tsv"

# separate each row into a single peak-gene-type combination, i.e. split by ";"
tidyr::separate_rows(gene, distance, peak_type, sep = ';')

# Convert to a sparse binary matrix of peak-gene mapping relationship
# the order of the peaks is the same as the peak-barcode matrix in the pipeline output
# the order of the genes is alphanumeric

stats::xtabs(value ~ peak + gene, data = ., sparse = T)