---
title: "The ChIPpeakAnno user's guide"
author: "Jianhong Ou, Jun Yu, Lihua Julie Zhu"
output: BiocStyle::html_document
date: "`r doc_date()`"
package: "`r pkg_ver('ChIPpeakAnno')`"
bibliography: bibliography.bib
csl: nature.csl
abstract: >
  The package is to facilitate the downstream analysis for ChIP-seq experiments. 
  It includes functions to find  the nearest gene, exon, miRNA or custom features such as the most
  conserved elements and other transcription factor binding sites
  supplied by users, retrieve the sequences
  around the peak, obtain enriched Gene Ontology (GO) terms or pathways. 
  Starting 2.0.5, new functions have been
  added for finding the peaks with bi-directional promoters with
  summary statistics (peaksNearBDP), for summarizing the
  occurrence of motifs in peaks (summarizePatternInPeaks) and for
  adding other IDs to annotated peaks or enrichedGO (addGeneIDs). Starting 3.4, 
  we also implement functions for permutation test to determine the association 
  between two sets of peaks, and to plot heatmaps for given feature/peak ranges. 
  This package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, 
  multtest and stat packages.
vignette: >
  %\VignetteIndexEntry{ChIPpeakAnno Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
    library(ChIPpeakAnno)
    library(org.Hs.eg.db)
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    library(rtracklayer)
    library(GO.db)
    library(EnsDb.Hsapiens.v75)
    library(BSgenome.Ecoli.NCBI.20080805)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```

# Introduction

Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and 
ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent 
high throughput technologies for identifying the binding sites of DNA-binding 
proteins genome-wise. A number of algorithms have been published to facilitate 
the identification of the binding sites of the DNA-binding proteins of interest.
The identified binding sites as the list of peaks are usually converted to BED 
or bigwig files to be loaded to the UCSC genome browser as custom tracks for 
investigators to view the proximity to various genomic features such as genes, 
exons or conserved elements. However, clicking through the genome browser is a 
daunting task when the number of peaks gets large or the peaks spread widely 
across the genome. 

Here we developed **ChIPpeakAnno**, a Bioconductor[@Gentleman2004] package, to facilitate 
the batch annotation of the peaks identified from ChIP-seq or ChIP-chip 
experiments. We implemented functionality to find the nearest gene, exon, miRNA
or other custom features supplied by users such as the most conserved 
elements and other transcription factor binding sites leveraging GRanges. Since 
the genome annotation gets updated frequently, we have leveraged the 
**biomaRt** package to retrieve the annotation data on the fly. The users 
also have the flexibility to pass their own annotation data or annotation from 
**GenomicFeatures** as GRanges. We have also leveraged **BSgenome** 
and **biomaRt** to retrieve the sequences around the identified peak for 
peak validation or motif discovery[@Durinck2005]. To understand whether the identified peaks 
are enriched around genes with certain GO terms, we have implemented the Gene 
Ontology (GO) enrichment test in the **ChIPpeakAnno** package leveraging the 
hypergeometric test phyper in the **stats** package and integrated with the GO 
annotation from the **GO.db** package and multiplicity adjustment functions 
from the **multtest** package[@Benjamini1995; @benjamini2001; @johnson2005; @Holm1979; @Hochberg1988; @dudoit2003]. 
The pathway analysis using reactome or KEGG is also supported. Starting 3.4, 
we also implement the functions for permutation test
to determine the association between two sets of peaks, and to plot heatmaps for
given feature/peak ranges. 


# Quick start
```{r quickStart}
library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38, 
                                output="overlapping", maxgap=5000L)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                        orgAnn="org.Hs.eg.db", 
                        IDs2Add="symbol")

if(interactive()){## annotate the peaks with UCSC annotation
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    macs.anno <- annotatePeakInBatch(macsOutput, 
                                     AnnotationData=ucsc.hg38.knownGene, 
                                     output="overlapping", maxgap=5000L)
    macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                            orgAnn="org.Hs.eg.db", 
                            feature_id_type="entrez_id",
                            IDs2Add="symbol")
}
```


# An example ChIP-seq analysis workflow using ChIPpeakAnno
We illustrate here a common downstream analysis workflow for ChIP-seq 
experiments. The input of **ChIPpeakAnno** is a list of called peaks identified 
from ChIP-seq experiments. The peaks are represented by by GRanges
in **ChIPpeakAnno**. We implemented a conversion functions `toGRanges` 
to convert commonly used peak file formats, such as 
BED, GFF, or other user defined formats such as MACS (a popular peak calling 
program) output file to GRanges. Type ?`toGRanges` for more information.

The workflow here exemplifies converting the BED and GFF files to GRanges,
finding the overlapping peaks between the two peak sets, and visualizing the 
number of common and specific peaks with Venn diagram.

```{r workflow19, fig.cap="venn diagram of overlaps for duplicated experiments"}
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
gr1[1:2]
gr1.import[1:2] #note the name slot is different from gr1
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)
```

A pie chart is used to demonstrate the overlap features of the common peaks.

```{r workflow20,fig.cap="Pie chart of common peaks among features"}
pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
```

After finding the overlapping peaks, you can use `annotatePeakInBatch` 
to annotate all the genomic features in the _AnnotationData_ within 5kb from 
those peaks. 


```{r workflow21}
overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37) 
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
##                                      output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- annoGR(EnsDb.Hsapiens.v75, feature="gene")
info(annoData)
annoData[1:2]
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
                                    output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <- 
    annoData$gene_name[match(overlaps.anno$feature,
                             names(annoData))]
head(overlaps.anno)
```

Once the peaks are annotated, the distribution of the distance to the nearest 
feature such as the transcription start sites (TSS) can be plotted. 
The sample code here plots the distribution 
of the aggregated peak scores and the number of peaks around the TSS.

```{r workflow22,fig.cap="Distribution of aggregated peak scores or peak numbers around transcript start sites.",fig.width=8,fig.height=6}
gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
               radius=5000, nbins=10, FUN=c(sum, length),
               ylab=c("score", "count"), 
               main=c("Distribution of aggregated peak scores around TSS", 
                      "Distribution of aggregated peak numbers around TSS"))
```

The distribution of the peaks over exon, intron, enhancer, proximal promoter, 
5' UTR and 3' UTR can be summarized in peak centric or nucleotide centric view using 
the function `assignChromosomeRegion`. 
Note: setting nucleotideLevel = TRUE will give a nucleotide level distribution over 
different features.

```{r workflow23,fig.cap="Peak distribution over different genomic features.",fig.width=10,fig.height=4}
if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
    aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
    barplot(aCR$percentage)
}
```

# Detailed Use Cases and Scenarios

Here we describe some details in using different functions in `ChIPpeakAnno` 
for different tasks. As shown in the last section, the common workflow includes:
loading called peaks from BED, GFF, or other formats; evaluating and visualizing 
the concordance among the biological replicates; combining peaks from 
replicates; preparing genomic annotation(s) as GRanges; associating/annotating 
peaks with the annotation(s); summarizing peak 
distributions over exon, intron, enhancer, proximal promoter, 5'UTR and 3'UTR 
regions; retrieving the sequences around the peaks; and enrichment analysis of GO and 
biological pathway. We also implement the functions to plot the heatmap of given
peak ranges, and perform permutation test to determine if there is an association
between two sets of peaks.

## Determine the overlapping peaks and visualize the overlaps with Venn diagram

It is important to evaluate the concordance among the peaks of biological 
replicates. Prior to associating features of interest with the peaks, it is a 
common practice to combine the peaks from biological replicates. Also, it is biologically 
interesting to obtain overlapping peaks from different ChIP-seq experiments
to imply the potential formation of transcription factor complexes. `ChIPpeakAnno` implemented 
functions to achieve those goals and quantitatively determine the significance of 
peak overlaps and generate a Venn diagram for visualization. 

Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for 
two peak ranges.

```{r findOverlapsOfPeaks3}
peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")),
                  strand="+")

peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3", 
                                     "4", "5", "6", "6", "6", "6", "6", "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist
```

The function `findOverlapsOfPeaks` returns an object of **overlappingPeaks**, 
which contains there elements: venn_cnt, peaklist (a list consists of all 
overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame 
consists of the annotation of all the overlapping peaks). 

Within the overlappingPeaks element of the **overlappingPeaks** object ol (which is also a list), the element 
"peaks1///peaks2" is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak 
lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the 
relative positions of peaks1 to peaks2 for the overlapping peaks.

```{r overlappingPeaks4,fig.cap="Pie chart of common peaks among features."}
overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
dim(overlappingPeaks[["peaks1///peaks2"]])
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
```

The following code returns the merged overlapping peaks from the peaklist object.


```{r overlappingPeaks5}
peaklist[["peaks1///peaks2"]]
```


The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with: 


```{r 6}
peaklist[["peaks1"]]
```


The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:


```{r 7}
peaklist[["peaks2"]]
```

Venn diagram can be generated by the function `makeVennDiagram` using the output 
of `findOverlapsOfPeaks` as an input.

The `makeVennDiagram` also outputs p-values indicating whether the overlapping is significant.

```{r findOverlapsOfPeaks8,fig.cap="venn diagram of overlaps",fig.width=6,fig.height=6}
makeVennDiagram(ol, totalTest=1e+2)
```

Alternatively,  users have the option to use other tools to plot Venn diagram. The following code demonstrates how to use a third party R package **Vernerable** with the output from the function `findOverlapsOfPeaks`. 


```{r VennerableFigure}
#     install.packages("Vennerable", repos="http://R-Forge.R-project.org", 
#                     type="source")
#     library(Vennerable)
#     venn_cnt2venn <- function(venn_cnt){
#         n <- which(colnames(venn_cnt)=="Counts") - 1
#         SetNames=colnames(venn_cnt)[1:n]
#         Weight=venn_cnt[,"Counts"]
#         names(Weight) <- apply(venn_cnt[,1:n], 1, paste, collapse="")
#         Venn(SetNames=SetNames, Weight=Weight)
#     }
# 
#     v <- venn_cnt2venn(ol$venn_cnt)
#     plot(v)
```


The `findOverlapsOfPeaks` function accepts 
up to 5 peak lists for overlapping peaks.  The following code is an example for 3 peak lists.

```{r findOverlapsOfPeaks9,fig.cap="venn diagram of overlaps for three input peak lists",fig.width=6,fig.height=6}
peaks3 <- GRanges(seqnames=c("1", "2", "3", "4", "5", 
                             "6", "1", "2", "3", "4"),
                   ranges=IRanges(start=c(967859, 2010868, 2496500, 3075966,
                                          3123460, 3851500, 96865, 201189, 
                                          249600, 307386),
                                  end= c(967969, 2011908, 2496720, 3076166,
                                         3123470, 3857680, 96985, 201299, 
                                         249890, 307796),
                                  names=paste("p", 1:10, sep="")),
                  strand=c("+", "+", "+", "+", "+", 
                           "+", "-", "-", "-", "-"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, peaks3, maxgap=1000, 
                          connectedPeaks="min")
makeVennDiagram(ol, totalTest=1e+2)
```

The parameter _totalTest_ in the function `makeVennDiagram` indicates how many 
potential peaks in total will be used in the hypergeometric test. It should be 
larger than the largest number of peaks in the replicates. The smaller it is 
set, the more stringent the test is. The time used to calculate p-value does not
depend on the value of the totalTest. For practical guidance on how to choose _totalTest_, 
please refer to the [post](https://stat.ethz.ch/pipermail/bioconductor/2010-November/036540.html).
We also implement a `permTest` function, in which the number of totalTest is not required. 
For more details about the `permTest`, go to section [**4.11**](#section410). 


## Generate annotation data
One main function of the **ChIPpeakAnno** package is to annotate the positional 
relationship between the peaks and the known genomic features, such as TSS, 5'UTR, 3'UTR etc. 
Constructing and choosing the appropriate annotation data is crucial for this 
process. 

To simplify this process, we precompiled the annotation data for the 
transcriptional starting sites (TSS) of various species (with 
different genome assembly versions). Those TSS annotations include 
TSS.human.NCBI36, TSS.human.GRCh37, TSS.human.GRCh38, TSS.mouse.NCBIM37, 
TSS.mouse.GRCm38, TSS.rat.RGSC3.4, TSS.rat.Rnor\_5.0, TSS.zebrafish.Zv8, and 
TSS.zebrafish.Zv9. Those precompiled annotations can be loaded by R `data()` function.

To annotate the peaks with other genomic features, we provide the function
`getAnnotation` with the argument _featureType_, e.g., "Exon" to obtain 
 the nearest exon, "miRNA" to find the nearest miRNA, and "5utr" or
 "3utr" to locate the overlapping "5'UTR" or "3'UTR". Another parameter for 
`getAnnotation` is the name of the appropriate biomaRt dataset, for example, 
drerio\_gene\_ensembl for zebrafish genome, mmusculus\_gene\_ensembl for mouse 
genome and rnorvegicus\_gene\_ensembl for rat genome. For a list of available 
biomaRt and dataset, please refer to the **biomaRt** package documentation[@Durinck2005].
For the detailed usage of `getAnnotation`, you can
type ?`getAnnotation` in R.

You can also determine to use the biologically appropriate database that is 
related with your biological questions. For example, if you have a list of 
transcription factor binding sites from literatures and are interested in 
locating the nearest TSS and the distance to it for the peak lists. 
You can pass the custom annotation dataset into the function 
`annotatePeakInBatch` as GRanges. Use `toGRanges` function for conversion if necessary. 

To facilitate the creation and documentation of the annotation data, 
we implement an `annoGR` class, which is an extension of GRanges class, to represent 
the annotation data. An `annoGR` object can be constructed from EnsDb, TxDb, or the user defined 
GRanges object by calling the `annoGR` function. 
The advantage of this class is that it contains the meta data 
such as the source and the timestamp (date) of the data source. 
Use ?`annoGR` for more information.

One sample use case is that you want to annotate only the known genes, not other transcript 
products, such as pseudo genes.  A code snippet to build this annotation data using _TranscriptDb_
TxDb.Hsapiens.UCSC.hg19.knownGene with annoGR is: 

```{r annoGRgene}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- annoGR(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")
info(annoData)
annoData
```

## Find the nearest feature and the distance to the feature for the peaklists

With the annotation data, you can annotate the peaks identified from 
ChIP-seq or ChIP-chip experiments to retrieve the nearest gene and distance to 
the corresponding TSS of the gene. 

For example, using the `annoGR` object generated
in previous section as AnnotationData, the first 6 peaks in the myPeakList are 
annotated with the following code:

```{r annotatePeakInBatchByannoGR}
data(myPeakList)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
                                     AnnotationData = annoData)
annotatedPeak[1:3]
```

As discussed in the previous 
section, all the genomic locations of the human genes have been precompiled, such as 
TSS.human.NCBI36 dataset, using function `getAnnotation`. You can pass it
to the argument _annotaionData_ of the `annotatePeakInBatch` function.


```{r annotatePeakInBatch1}
data(TSS.human.NCBI36)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6], 
                 AnnotationData=TSS.human.NCBI36)
annotatedPeak[1:3]
```

You can also pass the user defined features as annotationData. 
A pie chart can be plotted to show the peak distribution among the
features after annotation.


```{r annotatePeakInBatch2}
myPeak1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")))

TFbindingSites <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", 
                                     "3", "4", "5", "6", "6", "6", "6", "6",
                                     "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

annotatedPeak2 <- annotatePeakInBatch(myPeak1, AnnotationData=TFbindingSites)
annotatedPeak2[1:3]
```
```{r pie1,fig.cap="Pie chart of peak distribution among features",fig.width=5,fig.height=5}
pie1(table(as.data.frame(annotatedPeak2)$insideFeature))
```

Another example of user specific _AnnotationData_ is to annotate peaks by promoters, defined
as upstream 5K to downstream 500bp from TSS. 
The sample code here demonstrates using the `GenomicFeatures::promoters` function to build a 
custom annotation dataset and annotate the peaks with this user defined promoter annotations.

```{r annotatedPromoter}
library(ChIPpeakAnno)
data(myPeakList)
data(TSS.human.NCBI36)
annotationData <- promoters(TSS.human.NCBI36, upstream=5000, downstream=500)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6,], 
                                     AnnotationData=annotationData,
                                     output="overlapping")
annotatedPeak[1:3]
```


In the function `annotatyePeakInBatch`, various parameters can be adjusted to 
specify the way to calculate the distance and how the features are selected. For
example, _PeakLocForDistance_ is to specify the location of the peak for distance 
calculation: "middle" (recommended) means using the middle of the peak, and 
"start" (default, for backward compatibility) means using the start of the peak 
to calculate the distance to the features. Similarly, _FeatureLocForDistance_ is to 
specify the location of the feature for distance calculation: "middle" means using 
the middle of the feature, "start" means using the start of the feature to calculate
the distance from the peak to the feature;  "TSS" (default) means using the 
start of the feature when the feature is on plus strand and using the end of 
feature when the feature is on minus strand; "geneEnd" means using end of the 
feature when feature is on plus strand and using start of feature when feature 
is on minus strand. 

The argument "output" specifies the characteristics of the output of the 
annotated features. The default is "nearestLocation", which means to output the 
nearest features calculated as PeakLocForDistance-FeatureLocForDistance; 
"overlapping" will output the overlapping features within the maximum gap specified as 
maxgap between the peak range and feature range; "shortestDistance" will output the 
nearest features; "both" will output all the nearest features, in addition, will
output any features that overlap the peak that are not the nearest features. 
other options see ?annotatePeakInBatch.

## Find the overlapping and flanking features

In addition to annotate peaks to nearest genes, **ChIPpeakAnno** can also reports all 
overlapping and flanking genes by setting output="both" and maxgap in `annotatePeakInBatch`. 
For example, it outputs all overlapping and flanking genes within 5kb plus nearest genes 
if set maxgap = 5000 and output ="both".
```{r}
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
                                     AnnotationData = annoData,
                                     output="both", maxgap=5000)
head(annotatedPeak)
```


## Add other feature IDs to the annotated peaks

Additional annotations features such as entrez ID, gene symbol and gene name can be added
with the function `addGeneIDs`. The annotated peaks can be saved as an Excel file 
or plotted for visualizing the peak distribution relative to the genomic 
features of interest. Here is an example to add gene symbol to the annotated peaks.
Use ?`addGeneIDs` for more information.


```{r addGeneIDs18}
data(annotatedPeak)
library(org.Hs.eg.db)
addGeneIDs(annotatedPeak[1:6], orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"))
addGeneIDs(annotatedPeak$feature[1:6], orgAnn="org.Hs.eg.db", 
           IDs2Add=c("symbol"))
```


## Obtain the sequences surrounding the peaks

Here is an example to get the sequences of the peaks plus 20 bp upstream and downstream 
for PCR validation or motif discovery.


```{r getAllPeakSequence12}
peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
                 ranges=IRanges(start=c(100, 500), 
                                end=c(300, 600), 
                                names=c("peak1", "peak2")))
library(BSgenome.Ecoli.NCBI.20080805)
peaksWithSequences <- getAllPeakSequence(peaks, upstream=20, 
                                         downstream=20, genome=Ecoli)
```


The obtained sequences can be converted to fasta format for motif 
discovery by calling the function `write2FASTA`.


```{r write2FASTA13}
write2FASTA(peaksWithSequences,"test.fa")
``` 


## Heatmap for given feature/peak ranges

You can easily visualize and compare the binding patterns of raw signals of multiple ChIP-Seq experiments using function 
`featureAlignedHeatmap` and `featureAlignedDistribution`.


```{r heatmap,fig.cap="Heatmap of aligned features",fig.width=4,fig.height=6}
path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
names(data) <- gsub(".broadPeak", "", files)
ol <- findOverlapsOfPeaks(data)
#makeVennDiagram(ol)
features <- ol$peaklist[[length(ol$peaklist)]]
wid <- width(features)
feature.recentered <- feature.center <- features
start(feature.center) <- start(features) + floor(wid/2)
width(feature.center) <- 1
start(feature.recentered) <- start(feature.center) - 2000
end(feature.recentered) <- end(feature.center) + 2000
## here we also suggest importData function in bioconductor trackViewer package 
## to import the coverage.
## compare rtracklayer, it will save you time when handle huge dataset.
library(rtracklayer)
files <- dir(path, "bigWig")
cvglists <- sapply(file.path(path, files), import, 
                       format="BigWig", 
                       which=feature.recentered, 
                       as="RleList")
names(cvglists) <- gsub(".bigWig", "", files)
sig <- featureAlignedSignal(cvglists, feature.center, 
                            upstream=2000, downstream=2000) 
heatmap <- featureAlignedHeatmap(sig, feature.center, 
                                 upstream=2000, downstream=2000,
                                 upper.extreme=c(3,.5,4))
```


```{r distribution,fig.cap="Distribution of aligned features",fig.width=6,fig.height=6}
featureAlignedDistribution(sig, feature.center, 
                           upstream=2000, downstream=2000,
                           type="l")
```


## Output a summary of motif occurrences in the peaks.

Here is an example to search the motifs in the binding peaks. The motif patterns to be searched are saved in the file examplepattern.fa.


```{r summarizePatternInPeaks17}
peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
                 ranges=IRanges(start=c(100, 500), 
                                end=c(300, 600), 
                                names=c("peak1", "peak2")))
filepath <- system.file("extdata", "examplePattern.fa", package="ChIPpeakAnno")
readLines(filepath)
library(BSgenome.Ecoli.NCBI.20080805)
summarizePatternInPeaks(patternFilePath=filepath, format="fasta", skip=0L, 
                        BSgenomeName=Ecoli, peaks=peaks)
```


## Obtain the enriched Gene Ontology (GO) terms or reactome/KEGG terms for the genes near the peaks

With the annotated peak data, you can call the function `getEnrichedGO` to obtain a list of 
enriched GO terms. 
For pathway analysis, you can call function `getEnrichedPATH` using reactome or KEGG database.

In the following sample code we used a subset of the annotatedPeak (the first 500 peaks) for demonstration. All annotated peaks should be used in the real situation. 
```{r getEnriched14}
library(org.Hs.eg.db)
over <- getEnrichedGO(annotatedPeak[1:500], orgAnn="org.Hs.eg.db", 
                    maxP=0.01, multiAdj=FALSE, minGOterm=10, 
                    multiAdjMethod="", condense=FALSE)
head(over[["bp"]][, -3])
head(over[["cc"]][, -3])
head(over[["mf"]][, -3])
``` 
Please note that the default setting of _feature\_id\_type_ is "ensembl\_gene\_id". 
If you are using TxDb as annotation data, please try to change it to "entrez_id".

Please also note that **org.Hs.eg.db** is the GO gene mapping for Human, for other 
organisms, please refer to [released organism annotations](http://www.bioconductor.org/packages/release/data/annotation/),
or call function `egOrgMap` to get the name of annotation database.


```{r egOrgMap15}
egOrgMap("Mus musculus")
egOrgMap("Homo sapiens")
```


## Find peaks with bi-directional promoters

Bidirectional promoters are the DNA regions located between the 5' ends 
of two adjacent genes coded on opposite strands. 
The two adjacent genes are transcribed to the opposite directions, 
and often co-regulated by this shared promoter region[@robertson2007]. 
Here is an example to find peaks with bi-directional promoters and 
output the percentage of the peaks near bi-directional promoters.

```{r peaksNearBDP16}
data(myPeakList)
data(TSS.human.NCBI36)
annotatedBDP <- peaksNearBDP(myPeakList[1:10,], 
                             AnnotationData=TSS.human.NCBI36, 
                             MaxDistance=5000,
                             PeakLocForDistance="middle", 
                             FeatureLocForDistance="TSS")
annotatedBDP$peaksWithBDP
c(annotatedBDP$percentPeaksWithBDP, 
  annotatedBDP$n.peaks, 
  annotatedBDP$n.peaksWithBDP)
```


## Perform permutation test to determine if there is an association between two sets of peaks
<a name="section410"></a>

Given two peak lists from two transcript factors (TFs), one can ask whether 
DNA binding sites of the two TFs are correlated. Most tests are based on hypergeometric distribution, 
which require user to input an estimate of the total potential binding sites 
for a given TF. In contrast, `peakPermTest` implemented here is based on permutation test, which 
does not require user to supply an estimate of the total potential binding 
sites. The random peaks in the permutation test are generated using the 
distribution discovered from the input peaks for a given feature type 
(transcripts or exons) and relevant binding positions to the features 
("TSS", "geneEnd"). The width of the random peaks also follows 
the distribution of that of the input peaks.

Following are the sample codes to do the `permTest`:


```{r peakPermTest}
if(interactive()){
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    cds <- unique(unlist(cdsBy(txdb)))
    utr5 <- unique(unlist(fiveUTRsByTranscript(txdb)))
    utr3 <- unique(unlist(threeUTRsByTranscript(txdb)))
    set.seed(123)
    utr3 <- utr3[sample.int(length(utr3), 1000)]
    pt <- peakPermTest(utr3, 
             utr5[sample.int(length(utr5), 1000)], 
             maxgap=500,
             TxDb=txdb, seed=1,
             force.parallel=FALSE) 
    plot(pt)
    ## highly relevant peaks
    ol <- findOverlaps(cds, utr3, maxgap=1)
    pt1 <- peakPermTest(utr3,
             c(cds[sample.int(length(cds), 500)], 
                cds[queryHits(ol)][sample.int(length(ol), 500)]), 
             maxgap=500, 
             TxDb=txdb, seed=1,
             force.parallel=FALSE)
    plot(pt1)
}
```

Alternatively, users can create a pool of peaks representing all potential 
binding sites with associated binding probabilities for random peak sampling 
(see ?`preparePool`). Here is an example to build a human pool of peaks using 
the transcription factor binding site clusters (V3) (see ?`wgEncodeTfbsV3`) 
downloaded from [ENCODE](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegTfbsClustered/wgEncodeRegTfbsClusteredV3.bed.gz) with the HOT spots (?`HOT.spots`) removed. HOT spots are the genomic 
regions with high probability of being bound by many TFs in ChIP-seq experiments[@yip2012]. 
We suggest removing those HOT spots from the peak lists before performing 
permutation test to avoid the overestimation of the association between the two input 
peak lists. Users can also obtain the [ENCODE blacklist](https://sites.google.com/site/anshulkundaje/projects/blacklists) for a given species.
The blacklists were constructed by identifying consistently problematic regions over independent of cell line and type of experiment for each species in the ENCODE and modENCODE datasets[@encode2012integrated]. 
Please note that some of those blacklist may need to be liftover-ed to the correct genome assembly.

```{r peakPermTest1}
if(interactive()){
    data(HOT.spots)
    data(wgEncodeTfbsV3)
    hotGR <- reduce(unlist(HOT.spots))
    removeOl <- function(.ele){
        ol <- findOverlaps(.ele, hotGR)
        if(length(ol)>0) .ele <- .ele[-unique(queryHits(ol))]
        .ele
    }
    temp <- tempfile()
    download.file(file.path("http://hgdownload.cse.ucsc.edu", 
                            "goldenPath", "hg19", "encodeDCC", 
                            "wgEncodeRegTfbsClustered", 
                            "wgEncodeRegTfbsClusteredV3.bed.gz"), temp)
    data <- toGRanges(gzfile(temp, "r"), header=FALSE, format="others", 
                      colNames = c("seqnames", "start", "end", "TF"))
    unlink(temp)
    data <- split(data, data$TF)
    TAF1 <- removeOl(data[["TAF1"]])
    TEAD4 <- removeOl(data[["TEAD4"]])
    pool <- new("permPool", grs=GRangesList(wgEncodeTfbsV3), N=length(TAF1))
    pt <- peakPermTest(TAF1, TEAD4, pool=pool, ntimes=1000)
    plot(pt)
}
```

# Citing ChIPpeakAnno
Please cite `ChIPpeakAnno` in your publication as follows:
```{r citation, echo=FALSE}
citation("ChIPpeakAnno")
```


# Session Info
```{r sessionInfo}
sessionInfo()
```
# References