\documentclass[10pt]{article} \usepackage{natbib} \usepackage{graphicx} \usepackage[text={6in,9in},centering]{geometry} \usepackage{setspace} %%\onehalfspacing \usepackage{hyperref} \usepackage{url} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,width=9,height=11,prefix.string=figs/figs-peakmodel} \setkeys{Gin}{width=0.98\textwidth} \hypersetup{ colorlinks=true, urlcolor=blue, citecolor=blue } \newcommand{\dee}[1]{\ensuremath{\mathrm{d}#1}} \newcommand{\ddee}[1]{\ensuremath{\frac{\mathrm{d}}{\mathrm{d}#1}}} \title{Modeling peak locations in ChIP-Seq data} %% \author{} \begin{document} \bibliographystyle{plainnat} \maketitle \noindent Our goal is to formulate a model for ``peaks'' in a ChIP-Seq experiment. Much of this is purely an intellectual exercise for now. \\ \noindent The following features are desirable in a model: \begin{itemize} \item A background. This need not be constant (could vary by GC content, copy number, etc.) \item Peaks, at unknown locations, with varying enrichment. Potentially, these peaks may be ``overlapping''; that is, two peaks may be very close to each other (even actually overlapping, in the sense that two overlapping binding sites may be active in different cell subpopulations) \end{itemize} \noindent The things we need to model are \begin{itemize} \item the background \item locations of peaks \item varying enrichment \item distribution of fragment lengths \end{itemize} The data we have consists of location and orientation of one end of each read, for several million reads. \section{Proposed model} \begin{itemize} \item Location of peaks: possibilities are \begin{itemize} \item[(a)] Poisson point process with constant rate $\lambda$ \item[(b)] Known: e.g., all EBOX-s \end{itemize} Things are simplified if peak locations are known; the problem is then to find those with high enrichment. \item Enrichment: No way to really know the distribution, but a convenient choice may be the Gamma distribution, because then coverage at (independent) peaks would be Negative Binomial (mixture of Poisson-s with mean distributed as Gamma). We should be able to estimate the parameters of this Negative Binomial by truncating at a certain peak height, assuming that all peaks above a certain height are real. This may not work if peaks are too close, as then we only see the highest peak in an island. \item Fragment lengths: Normal seems as good as anything. \end{itemize} \noindent So, let $Z_1, \cdots, Z_K$ be the peak locations. $p_k \sim \mathrm{Gamma}(\alpha, p)$ is the enrichment of peak $k$. $p_0$ is the background enrichment (expected number of reads covering a random location). Let the observed data be $(X_j, S_j), j = 1, \cdots, n$, where $X_j$ is the location (first sequenced base) and $S_j$ is the orientation (strand) of read $j$. Let $L_j$ be the unknown length of read $j$. Then, \[ P\left((X_j, S_j) = (x, s) ~|~ Z_i \right) ~=~ P(L_j > |x-Z_j|) \] if $x$ is on the right side of $Z_i$, and $0$ otherwise. \section{Truncated Negative Binomial} \noindent The Gamma distribution with parameters shape $\kappa$ and scale $\theta$ has density \[ f(x)= \frac{1}{ \theta^\kappa \Gamma(\kappa) } x^(\kappa-1) e^{-x/\theta} \] for $x >= 0, \kappa > 0$ and $\theta > 0$. The negative binomial distribution with size $n$ and prob $p$ has density \[ p(x) = \frac{\Gamma(x+n)}{ \Gamma(n) \Gamma(x+1) } p^n (1-p)^x \] for $x = 0, 1, 2, ..., n > 0$ and $0 < p \leq 1$. This represents the number of failures which occur in a sequence of Bernoulli trials before a target number of successes is reached. It can also arise as a mixture of Poisson distributions with mean distributed as a Gamma distribution with scale parameter $\theta = (1 - p)/p$ and shape parameter $\kappa = n$. In this model $p = 1 / (1+\theta)$, and the mean is $\kappa \theta = n (1 - p)/p$. \\ \noindent An alternative parametrization (often used in ecology) is by the mean $\mu = \kappa \theta$ and the ``dispersion parameter'' $\sigma = \kappa$. The variance is $\mu + \mu^2/\kappa$ in this parametrization, or $n (1-p)/p^2 = \kappa \theta (1 + \theta)$ in the earlier ones. <>= library(stats4) ## for mle library(chipseq) library(latticeExtra) dtnbinom <- function(k, size, mu, log = FALSE) { const <- pnbinom(k - 0.5, size = size, mu = mu, lower.tail = FALSE, log.p = log) if (log) function(x) { ifelse(x < k, -Inf, dnbinom(x, size = size, mu = mu, log = TRUE) - const) } else function(x) { ifelse(x < k, 0, dnbinom(x, size = size, mu = mu, log = FALSE) / const) } } negllik <- function(data, cutoff = 8) { data <- subset(data, depth >= cutoff) ## data = data.frame(depth = , count = ) function(logsize, logmu) { dfun <- dtnbinom(k = cutoff, size = exp(logsize), mu = exp(logmu), log = TRUE) with(data, -sum(count * dfun(depth))) } } startest <- function(data) { ## naive: use untruncated dist mu <- with(data, sum(depth * count) / sum(count)) ex2 <- with(data, sum(depth^2 * count) / sum(count)) varx <- ex2 - mu^2 size <- (varx - mu) / mu^2 list(logsize = log(size), logmu = log(mu)) } islandDepthSummary <- function(x, g = extendReads(x)) { s <- slice(coverage(g, width = max(end(g))), lower = 1) tab <- table(viewMaxs(s)) ans <- data.frame(depth = as.numeric(names(tab)), count = as.numeric(tab)) ans } load("myodMyo.rda") combinedMyo <- list(cblasts = combineLaneReads(myodMyo[c("1","3","6")]), ctubes = combineLaneReads(myodMyo[c("2","4","7")])) combinedMyoDepth <- summarizeReads(combinedMyo, summary.fun = islandDepthSummary) depthdist.split <- split(combinedMyoDepth, f = list(combinedMyoDepth$lane, combinedMyoDepth$chromosome)) estimatePars <- function(data, cutoff = 8, ...) { minuslogl <- negllik(data = data, cutoff = cutoff) mle(minuslogl, start = startest(data), ...) } collapseChr <- function(data) { tab <- xtabs(count ~ depth, data) data.frame(depth = as.numeric(names(tab)), count = as.numeric(tab)) } @ <<>>= my.cutoff <- 10 subdata <- collapseChr(subset(combinedMyoDepth, lane == "ctubes")) my.parest <- estimatePars(subdata, cutoff = my.cutoff) summary(my.parest) exp(my.parest@coef) dfun <- dtnbinom(k = 10, size = exp(coef(my.parest)["logsize"]), mu = exp(coef(my.parest)["logmu"]), log = FALSE) rootogram(count ~ depth, data = subdata, subset = depth >= 10, dfun = dfun, xlim = c(0, 100)) ## ## try various ## varyPar <- function(u, v) ## { ## pars <- my.parest@coef + c(u, v) * sqrt(diag(my.parest@vcov)) ## dfun <- ## dtnbinom(k = 10, ## size = exp(pars["logsize"]), ## mu = exp(pars["logmu"]), ## log = FALSE) ## rootogram(count ~ depth, ## data = subdata, ## subset = depth >= 10, ## dfun = dfun, ## xlim = c(0, 100)) ## } @ <>= plot(trellis.last.object()) @ Actually, the likelihood profile is pretty weird. <<>>= est <- my.parest@coef est.sd <- sqrt(diag(my.parest@vcov)) f <- 1 g <- expand.grid(logsize = seq(est["logsize"] - f * est.sd["logsize"], est["logsize"] + f * est.sd["logsize"], length.out = 51), logmu = seq(est["logmu"] - f * est.sd["logmu"], est["logmu"] + f * est.sd["logmu"], length.out = 51), negllik = NA_real_) minuslogl <- negllik(data = subdata, cutoff = my.cutoff) for (i in seq_len(nrow(g))) { g$negllik[i] <- minuslogl(g$logsize[i], g$logmu[i]) } wireframe(scale(negllik) + scale(log(negllik)) ~ logsize + logmu, g, shade = TRUE, outer = TRUE, scales = list(relation = "free"), zlab = "") @ <>= plot(trellis.last.object()) @ We could now try to answer some interesting questions. Choose a random peak $Z$. Let its enrichment be $\lambda(Z) \sim \mathrm{Gamma}(\alpha, p)$. The observed coverage at the peak is $C(Z) \sim Poisson(\lambda(Z))$. For a given peak cutoff $k$, we detect $Z$ if $C(Z) \geq k$. For a given enrichment level $\lambda_0$, we are interested in $P(\lambda(Z) > \lambda_0)$ and \begin{eqnarray*} P( C(Z) \geq k ~|~ \lambda(Z) \geq \lambda_0 ) & = & \frac{ P\left( C(Z) \geq k , \lambda(Z) \geq \lambda_0 \right) }{ P\left( \lambda(Z) \geq \lambda_0 \right) } \end{eqnarray*} where \begin{eqnarray*} P\left( C(Z) \geq k , \lambda(Z) \geq \lambda_0 \right) & = & \int_0^\infty P\left( C(Z) \geq k , \lambda(Z) \geq \lambda_0 ~|~ \lambda(Z) = \lambda \right) f(\lambda) \dee{\lambda} \\ & = & \int_{\lambda_0}^\infty P\left( Poi(\lambda) \geq k \right) f(\lambda) \dee{\lambda} \end{eqnarray*} This doesn't have a closed-form solution (I think). But we can simulate: <<>>= par.mu <- exp(my.parest@coef["logmu"]) par.size <- exp(my.parest@coef["logsize"]) ## Gamma shape par.theta <- par.mu / par.size ## Gamma scale nsim <- 1e6 enrich <- rgamma(nsim, shape = par.size, scale = par.theta) peakcov <- rpois(length(enrich), enrich) table(peakcov) @ % Not very useful because too many low enrichment ``peaks''. Not sure how to efficiently get truncated Gamma. <>= ## P(coverage >= cutoff | expected coverage) xyplot(ppois(q = 10, lambda = lambda, lower = FALSE) ~ lambda, data = list(lambda = seq(0, 25, length = 51)), type = "l") @ \section{Likelihood model for single-peak islands} Assume that there is exactly one peak in a given region (perhaps an island, or a promoter). Our goal is to develop a MLE for the peak location. Parameters: \begin{itemize} \item $\theta$: peak location \item $p$: probability that a read is from the background (related to enrichment?) \item $\mu, \sigma$: parameters of fragment length distribution \end{itemize} The data are $(X_i, S_i)$, where $X_i$ is the read start location, and $S_i$ is the strand for read $i$. $X_i$ has a mixture density \[ f(x | s) = p~f_{0}(x) ~+~ (1-p)~f_{1}(x|s) \] where $f_{0}$ is uniform over the region, and $f_{1}(x|s)$ is the strand-specific foreground density, 0 on one side of the peak, and high close to the peak on the other side. Let $L_i$ be the unknown length of read $i$. For the strand with $X > \theta$, $X | L \sim U(\theta, \theta + L)$. Let the density of $L$ be $\psi$. Then, for $x \geq \theta$, \begin{eqnarray*} P(X \leq x) & = & \int_0^\infty P(X \leq x ~|~ L = l) ~ \psi(l) ~ \dee{l} \\ & = & \int_0^x \psi(l) ~ \dee{l} + \int_x^\infty \frac{x}{l} ~ \psi(l) ~ \dee{l} \end{eqnarray*} Some calculation shows that (assuming w.l.g. that $\theta=0$) \[ f_{1}(x) = \ddee{x} P(X \leq x) = \int_x^\infty \frac{x}{l} ~ \psi(l) ~ \dee{l} = E(1/L) - \int_0^x \frac{1}{l} ~ \psi(l) ~ \dee{l} \] If we model $L$ by the Gamma distribution, $f(x)$ is defined in terms of the partial Gamma function. If $L$ is degenerate at $\mu$ (i.e., $\sigma = 0)$, then \[ f_{1}(x) = \frac{1}{\mu} - \frac{1}{\mu} I \{ x \geq \mu \} = \frac{1}{\mu} I \{ x \leq \mu \} \] (i.e., $X \sim U(\theta, \theta + \mu)$). <>= rawData <- function(data = combinedMyo$ctubes, chr, start, end) { data <- data[[chr]] data <- lapply(data, function(x) x[x >= start & x <= end]) data } ## sigma = 0 version loglik <- function(theta, p, mu, sigma = 0, raw.data, width) { if (p < 0 || p > 1 || mu < 0) return(-Inf) f0 <- function(x) 1/width f1.minus <- function(x) ifelse(x >= theta & x <= theta + mu, 1/mu, 0) f1.plus <- function(x) ifelse(x <= theta & x >= theta - mu, 1/mu, 0) sum(log(p * (f0(raw.data[["+"]])) + (1-p) * f1.plus(raw.data[["+"]]))) + sum(log(p * (f0(raw.data[["-"]])) + (1-p) * f1.minus(raw.data[["-"]]))) } negllik.p.mu <- function(theta, sigma = 0, raw.data, width) { function(p, mu) -loglik(theta = theta, p = p, mu = mu, sigma = sigma, raw.data = raw.data, width = width) } doTheta <- function(theta = 83300000, raw.data = rd, width = w) { lfun <- negllik.p.mu(theta = theta, raw.data = raw.data, width = width) m <- mle(lfun, start = list(p = 0.5, mu = 200)) c(M2LL = 2 * m@min, m@coef, sd = sqrt(diag(m@vcov))) } doRegion <- function(data = combinedMyo$ctubes, chr, start, end, plot = interactive(), by = 5) { rd <- rawData(data = combinedMyo$ctubes, chr = chr, start = start, end = end) w <- end - start ## if (plot) plot(stripplot(which ~ data, do.call(make.groups, rd), jitter = TRUE)) doTheta <- function(theta) { lfun <- negllik.p.mu(theta = theta, raw.data = rd, width = w) m <- try(mle(lfun, start = list(p = 0.5, mu = 200)), silent = TRUE) if (inherits(m, "try-error")) rep(NA_real_, 5) else c(M2LL = 2 * m@min, m@coef, sd = sqrt(diag(m@vcov))) } i <- seq(start, end, by = by) ans <- sapply(i, function(theta) { if (interactive()) message(theta) doTheta(theta) }) ans <- as.data.frame(t(ans)) names(ans) <- c("M2LL", "p", "mu", "sd.p", "sd.mu") ans$theta <- i ans$plus <- density(rd[["+"]], from = start, to = end, n = length(i))$y ans$minus <- density(rd[["-"]], from = start, to = end, n = length(i))$y if (plot) { plot(xyplot(M2LL + p + mu + plus + minus ~ theta, data = ans, type = "l", outer = TRUE, scales = list(y = list(relation = "free", rot = 0)), strip = FALSE, strip.left = TRUE, ylab = "", layout = c(1, 5))) } invisible(ans) } ## Some high peaks: ## chromosome start end comb.max ## 30938 chr15 83299674 83300455 505 ## 37239 chr17 31927395 31928306 519 ## 41970 chr18 78063806 78064390 524 ## 42317 chr19 3923319 3924497 507 ## 48815 chr2 102776952 102777617 567 ## 65025 chr5 121038253 121038858 561 ## 80780 chr8 125374395 125375194 619 ## rd <- rawData(data = combinedMyo$ctubes, chr = "chr15", start = 83299674, end = 83300455) ## w <- 83300455 - 83299674 ## stripplot(which ~ data, do.call(make.groups, rd), jitter = TRUE) @ Here are a couple of examples, done with $\sigma = 0$. \newpage <<>>= doRegion(data = combinedMyo$ctubes, chr = "chr17", start = 31927395, end = 31928306, plot = TRUE) @ <>= plot(trellis.last.object()) @ \newpage <<>>= doRegion(data = combinedMyo$ctubes, chr = "chr2", start = 102776952, end = 102777617, plot = TRUE) @ <>= plot(trellis.last.object()) @ \end{document}