%\VignetteIndexEntry{oligo - Primer} %\VignetteKeywords{expression, exon, gene, tiling, SNP} %\VignetteDepends{oligoData, maqcExpression4plex, genefilter, limma,RColorBrewer, biomaRt, BSgenome.Hsapiens.UCSC.hg18,GenomeGraphs,ACME} %\VignettePackage{oligo} \documentclass[12pt, letterpaper]{article} %% R stuff to use in LaTeX \newcommand{\R}{{\textsf{R}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand{\oligo}{\Rpackage{oligo}} \usepackage{graphicx, setspace, Sweave, amsfonts, amsmath, multirow, hyperref, amssymb, rotating, subfigure} \usepackage[authoryear,round]{natbib} \usepackage[margin=1in]{geometry} %% Location of the figures \graphicspath{{./figures/}} %% Double spacing \doublespacing %% Sweave stuff \SweaveOpts{eps=FALSE, prefix.string=figures/fig} \usepackage{color} \definecolor{midnightblue}{rgb}{0.098,0.098,0.439} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl, baselinestretch=1, formatcom=\color{midnightblue}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{baselinestretch=1, formatcom=\color{midnightblue}} \DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl, baselinestretch=1} \fvset{listparameters={\setlength{\topsep}{0pt}}} \title{\Rpackage{oligo} - Primer} \author{Benilton Carvalho} \date{} \begin{document} %% Initializating the R environment <>= options(prompt="R> ", continue=" ", width=70, device=pdf) @ \maketitle \newpage \tableofcontents \newpage \section{Data Import} \label{sec:dataImport} To import data using the \Rpackage{oligo} package, the user must have data at the probe-level. This means that if Affymetrix data are to be imported, the user is expected to have CEL files; if Nimblegen data are used instead, then XYS files are to be available. Once sets of such files are available, the user can two tools, depending on the array manufacturer, to import the data: \Rfunction{read.celfiles} - for CEL files; and \Rfunction{read.xysfiles} - for XYS files. To assist the user on obtaining the names of the CEL or XYS files, the package provides two functions, \Rfunction{list.celfiles} and \Rfunction{list.xysfiles}, which accept the same arguments as the \Rfunction{list.files} function defined in the \R{} \Rpackage{base} package. The basic usage of the package tools to import CEL or XYS files present in the current directory consists in combining the \Rfunction{read.\*files} functions with their \Rfunction{list.\*files} counterparts, as shown below, in a hypothetical example: <>= library(oligo) celFiles <- list.celfiles() affyRaw <- read.celfiles(celFiles) xysFiles <- list.xysfiles() nimbleRaw <- read.xysfiles(xysFiles) @ The \Rpackage{oligo} package will attempt to identify the annotation package required to read the data in. If this annotation package is not installed, \Rpackage{oligo} will try to download it from BioConductor. If the annotation is not available on BioConductor, the user should use the \Rpackage{pdInfoBuilder} package to create an appropriate annotation. In case \Rpackage{oligo} fails to identify the annotation package's name correctly, the user can use the \Rfunarg{pkgname} argument available for both \Rfunction{read.celfiles} and \Rfunction{read.xysfiles}. From this point on, this document provides examples on the usage of the \Rpackage{oligo} package using datasets available in the \Rpackage{oligoData} package. \begin{table}[!htp] \centering \begin{minipage}{1.0\linewidth} \begin{tabular}{|c|c|} \hline Object Name & Description \\ \hline affyExpressionFS & Latin Square - Affymetrix U95A \\ nimbleExpressionFS & Sample Expression Dataset Nimbeglen \\ affyExonFS & Exon Sample Dataset - Human\\ affySnpFS & HapMap samples on XBA Array\\ nimbleTilingFS & Sample ChIP-chip dataset \\ \hline \end{tabular} \caption{Datasets used in this document.} \label{tab:datasets} \end{minipage} \end{table} \section{Preprocessing Expression Arrays} \label{sec:expressionApp} \subsection{Affymetrix Expression} \label{sec:affyExpr} The dataset used in this example corresponds to the Latin Square Data for Expression Algorithm Assessment on the Human Genome U95 platform, made available by Affymetrix on their website\footnote{\url{http://www.affymetrix.com/support/technical/sample_data/datasets.affx}}. To be used with \Rpackage{oligo}, requires the availability of the \Rpackage{pd.hg.u95a} annotation package, built with the \Rpackage{pdInfoBuilder} package. After the annotation package is installed, the next step is to load \Rpackage{oligo} and identify the files to be used in the analysis. The \Rfunction{list.celfiles} function can be used to appropriately list Affymetrix CEL files. If CEL files were available in a directory called \Robject{expressionData}, then, in the snippet below, the \Robject{celFiles} would contain the CEL filenames, including the full path. <>= library(oligo) celFiles <- list.celfiles("expressionData", full.names=TRUE) @ Importing the CEL files is achieved with the \Rfunction{read.celfiles} function. The function will, in general, correctly identify the annotation package to be used with the experimental data being imported, but the user can specify the \Rfunarg{pkgname} argument to force the use of a particular one, if for some reason this is required. Note that the snippet below corresponds to a hypothetical example, in which we would read CEL files saved in a directory called \Robject{expressionData}. <>= affyExpressionFS <- read.celfiles(celFiles, pkgname="pd.hg.u95a") @ In reality, the \Robject{affyExpressionFS} object is already available in the \Rpackage{oligoData} package. The example below demonstrates how it could be loaded. <>= library(oligoData) data(affyExpressionFS) @ The object \Robject{affyExpressionFS} belongs to the \Rclass{ExpressionFeatureSet} class, as it corresponds to expression data. The object, like all \Rclass{FeatureSet}-like objects, represents features in the rows and samples in the columns and can be easily subsetted, using the standard \Robject{$[$} operator. All the manipulation structure is inherited through te tight integration between \Rpackage{oligo} and \Rpackage{Biobase}, whose documentation we recommend to the interested reader. <>= class(affyExpressionFS) @ Figure~\ref{fig:imgAffyExp} demonstrates how the \Rmethod{image} method can be used to generate pseudo-images of the samples. In this particular plot, we use the first sample as an example and a grayscale palette for plotting. <>= image(affyExpressionFS, 1, col=gray((64:0)/64)) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-imageAffyExpression} \caption{Pseudo-image, used for visual assessment of the array, for sample \Robject{1521a99hpp\_av06.CEL}.} \label{fig:imgAffyExp} \end{figure} The user can evaluate the distribution of the observed data by using the \Rmethod{hist} method, which will produce smoothed histograms for each sample available in the dataset. Before plotting, the method transforms the data using the function passed to the \Rfunarg{transfo} argument, whose default is \Rfunction{log2}, explaining why the plot is shown on the $\log_2$ scale. <>= hist(affyExpressionFS) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-histRawAffyExpression} \caption{Smoothed histograms for samples in the dataset.} \label{fig:histAffyExp} \end{figure} Another approach to assess the data distribution is to use the \Rmethod{boxplot} method. On \Rclass{FeatureSet} objects, the method will automatically transform the data to the $\log_2$ scale, but this is easily modified through the \Rfunarg{transfo} argument, which takes a function as a valid value. <>= boxplot(affyExpressionFS) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-boxplotRawAffyExpression} \caption{Boxplot showing the distribution of the observed $\log_2$-intensities on the sample dataset. The \Rfunction{boxplot} method implemented in \Rpackage{oligo} follows the standards of the original method used by \R{}.} \label{fig:bpAffyExp} \end{figure} Plotting log-ratio \textit{versus} average intensity can often reveal intensity effects on log-ratios, as shown by the MA plot on Figure \ref{fig:maplotAffyExpression}. The argument \Rfunarg{arrays} can be specified to determine which samples will be plotted and the \Rfunarg{lowessPlot} is a logical flag to indicate that the user wants a lowess curve to be overlapped to the data points. <>= MAplot(affyExpressionFS, which=1, ylim=c(-1, 1)) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-maplotAffyExpression} \caption{The MA plot can be used to assess the dependence of log-ratios on average log-intensities.} \label{fig:maplotAffyExpression} \end{figure} The annotation packages used by \Rpackage{oligo} store feature sequences. This is done through instances of \Rclass{DNAStringSet} objects implemented in the \Rpackage{Biostrings} package. The sequences for PM probes can be easily accessed via the \Rfunction{pmSequence} function, as shown below. <>= pmSeq <- pmSequence(affyExpressionFS) pmSeq[1:5] @ When importing the data, \Rpackage{oligo} does not impose any transformation, so one needs to manually apply, for example, the $\log_2$ transform to the intensities of PM probes, which can be accessed with the \Rfunction{pm} function, as needed. Below, we present how to centralize the $\log_2$-PM intensities for each sample in \Robject{affyExpressionFS}. <>= pmsLog2 <- log2(pm(affyExpressionFS)) @ The dependence of intensity on probe sequence is a well established fact on the microarray literature. The use of the \Rpackage{oligo} package simplifies significantly the observation of this event, as it provides simple access to both observed intensities and annotation. Below, we estimate the affinity splines coefficients \citep{Wu:2004p1634}. <>= coefs <- getAffinitySplineCoefficients(pmsLog2, pmSeq) @ On Figure~\ref{fig:seqEffectAffyExpression}, we show how the results above can be used to estimate the base-position effects on the $\log_2$-intensities observed for the first sample in the dataset. The \Rfunction{getBaseProfile} function provides a simple way of using the affinity coefficients to estimate the effects of interest. It accepts a \Rfunarg{plot} argument, which takes logical values, to make the plot and returns, invisibly, the estimated effects. All the arguments that can be passed to the \Rfunction{matplot} function can also be passed to \Rfunction{getBaseProfile}. <>= colors <- darkColors(4) xL <- "Base Position" yL <- "Effect" pchs <- c("A", "C", "G", "T") getBaseProfile(coefs[,1], plot=TRUE, pch=pchs, type="b", xlab=xL, ylab=yL, lwd=3, col=colors, ylim=c(-.4, .4)) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-seqEffectAffyExpression} \caption{Sequence effect for the first sample in the dataset. These results have been reported in detail elsewhere, but can be easily reproduced with the use of the \Rpackage{oligo} package.} \label{fig:seqEffectAffyExpression} \end{figure} Tools implemented in other packages can be used in conjunction with \Rpackage{oligo} to investigate different hypothesis. The example below shows how the \Rfunction{alphabetFrequency} function, defined by the \Rpackage{Biostrings} can be used to determine the GC content of the probe sequences accessed by \Rpackage{oligo}. <>= counts <- Biostrings::alphabetFrequency(pmSeq, baseOnly=TRUE) GCcontent <- ordered(counts[, "G"]+counts[, "C"]) @ In addition to Figure~\ref{fig:seqEffectAffyExpression}, we can also plot the $\log_2$-intensities as a function of the GC content computed above. Figure~\ref{fig:gcBpAffyExpression} presents the strong dependency of $\log_2$-intensities on GC contents for sample~1, which is also present in all other samples. <>= colors <- seqColors(nlevels(GCcontent)) xL <- "GC Frequency in 25-mers" yL <- expression(log[2]~intensity) boxplot(pmsLog2[,1]~GCcontent, xlab=xL, ylab=yL, range=0, col=colors) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-intensityByContentAffyExpression} \caption{On this boxplot stratified by GC content, we can observe the strong dependency of $\log_2$-intensities on the number of G or C bases observed in the probe sequency.} \label{fig:gcBpAffyExpression} \end{figure} To preprocess expression data, \Rpackage{oligo} implements the RMA algorithm \citep{Irizarry2003a, Irizarry2003b}. The \Rmethod{rma} method, as shown below, proceeds with background subtraction, normalization and summarization using median-polish. <>= ppData <- rma(affyExpressionFS) @ The results are returned in an \Rclass{ExpressionSet} instance and used in downstream analyses, as currently done by several strategies for microarray data analysis and described elsewhere. <>= class(ppData) @ At this point, the user can proceed with, for example, differential expression analyses. The methodologies involved in this step make use of several other packages, like \Rpackage{limma} and \Rpackage{genefilter}. When preprocessing the data, \Rpackage{oligo} stores the summaries in a matrix called \Robject{exprs}, present in the \Robject{assayData} data slot of the \Rclass{ExpressionSet} object. Therefore, the only restriction the additional strategies used with the preprocessed data have is to be aware that the processed data can be easily accessed with the \Rmethod{exprs} method. \subsection{Nimblegen Expression} \label{sec:nimblExpr} This section presents a non-trivial use of the \oligo{} Package for the analysis of NimbleGen Expression data. This vignette follows the structure of the chapter \textbf{From CEL files to a list of interesting genes} by R. A. Irizarry \textit{in} \textit{Bioinformatics and Computational Biology Solutions Using R and Bioconductor}, which shows a case study for Affymetrix Expression arrays. In order to analyze microarray data using \oligo{}, the user is expected to have installed on the system a package with the annotation for the particular array design on which the experiment was performed. For the example in question here, the design is hg18\_60mer\_expr and the annotation package associated to it is \Rpackage{pd.hg18.60mer.expr}, which is built by using the \Rpackage{pdInfoBuilder} package. \subsubsection{Initialization of the environment} \label{sec:initNimblExpr} On this particular example, we will read XYS files instead of loading the \Rclass{FeatureSet} object already available through the \Rpackage{oligoData} package (the \Robject{maqc} object that we will create below is exactly the \Robject{nimbleExpressionFS} data object provided by the \Rpackage{oligoData} package). We start by loading the packages that are going to be used in this session. The \Rpackage{maqcExpression4plex} package provides a set of six samples on the MAQC Study; the set is comprised of samples on two groups: universal reference and brain. The remaining packages offer additional functionality, like tools for filtering, plotting and visualization. <>= library(oligo) library(maqcExpression4plex) library(genefilter) library(limma) @ Once the package is loaded, we can easily get the location of the XYS files that contain the intensities by calling \Rfunction{list.xysfiles}, which takes the same arguments as \Rfunction{list.files}. To minimize the chance of problems, we strongly recommend the use of \Rcode{full.names=TRUE}. <>= extdata <- system.file("extdata", package="maqcExpression4plex") xys.files <- list.xysfiles(extdata, full.names=TRUE) basename(xys.files) @ %def To read the XYS files, we provide the \Rfunction{read.xysfiles} function, which also takes \Robject{phenoData}, \Robject{experimentData} and \Robject{featureData} objects and returns an appropriate subclass of \Rclass{FeatureSet}. <>= theData <- data.frame(Key=rep(c("brain", "universal reference"), each=3)) rownames(theData) <- basename(xys.files) lvls <- c("channel1", "channel2", "_ALL_") vMtData <- data.frame(channel=factor("_ALL_", levels=lvls), labelDescription="Sample type") pd <- new("AnnotatedDataFrame", data=theData, varMetadata=vMtData) maqc <- read.xysfiles(xys.files, phenoData=pd) class(maqc) @ %def \subsubsection{Exploring the feature-level data} \label{sec:exploring} The \Rfunction{read.xysfiles} function returns, in this case, an instance of \Rclass{ExpressionFeatureSet} and the intensities of these files are stored in its \Robject{exprs} slot, which can be accessed with a method with the same name. <>= exprs(maqc)[10001:10010, 1:2] @ %def The \Rmethod{boxplot} method can be used to produce boxplots for the feature-level data. <>= boxplot(maqc, main="MAQC Sample Data") @ %def \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-maqcBoxplot} \caption{Distribution of $\log_2$-intensities of samples on the MAQC dataset.} \label{fig:maqcBp} \end{figure} Similarly, a smoothed histogram for the feature-level data can be obtained with the \Rmethod{hist} method. <>= hist(maqc, main="MAQC Sample Data") @ %def \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-maqcHist} \caption{Smoothed histogram of $\log_2$-intensities of samples on the MAQC dataset.} \label{fig:maqcHist} \end{figure} \subsubsection{RMA algorithm} \label{sec:rma} The RMA algorithm can be applied to the raw data of expression arrays. It is available via the \Rmethod{rma} method. The algorithm will perform background subtraction, quantile normalization and summarization via median polish. The result of \Rmethod{rma} is an instance of \Rclass{ExpressionSet} class, which also contains an \Rmethod{exprs} slot and method. <>= eset <- rma(maqc) class(eset) show(eset) exprs(eset)[1:10, 1:2] @ %def The \Rmethod{boxplot} and \Rmethod{hist} methods are also implemented for \Rclass{ExpressionSet} objects. Note that \Rmethod{rma}'s output is in the $\log_2$ scale, so we call such methods using the argument \Rcode{transfo=identity}, so the data are not transformed in any way. <>= boxplot(eset, transfo=identity, main="After RMA") @ %def <>= hist(eset, transfo=identity, main="After RMA") @ %def \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-rmaResultsB} \includegraphics[width=.45\textwidth]{fig-rmaResultsH} \caption{Boxplot and smoothed histogram for MAQC data after preprocessing.} \label{fig:maqcHist} \end{figure} \subsubsection{Assessing differential expression} \label{sec:diffexp} One simple approach to assess differential expression is to flag units with log-ratios greater (in absolute value) than 1, i.e. a change greater than 2-fold when comparing brain vs. universal reference. <>= e <- exprs(eset) index <- which(eset[["Key"]] == "brain") d <- rowMeans(e[, index])-rowMeans(e[, -index]) a <- rowMeans(e) sum(abs(d)>1) @ Another approach is to use $t$-tests to infer whether or not there is differential expression. <>= tt <- rowttests(e, factor(eset[["Key"]])) lod <- -log10(tt[["p.value"]]) @ The MA plot can be used to visualize the behavior of the log-ratio as a function of average log-intensity. Features with log-ratios greater (in absolute value) than 1 are candidates for being classified as differentially expressed. <>= smoothScatter(a, d, xlab="Average Intensity", ylab="Log-ratio", main="MAQC Sample Data") abline(h=c(-1, 1), col=2) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-maplot} \caption{MA plot for Brain vs. Universal Reference. The red lines show the threshold for fold-change of 2, up or down, which correspond to log- fold-change of $1$ and $-1$, respectively.} \label{fig:maplot} \end{figure} The use of $t$-tests allows us to use the volcano plot to visualize candidates for differential expression. Below, we highlight, in blue, the top 25 in log-ratio and, in red, the top 25 in effect size. <>= o1 <- order(abs(d),decreasing=TRUE)[1:25] o2 <- order(abs(tt[["statistic"]]),decreasing=TRUE)[1:25] o <- union(o1,o2) smoothScatter(d, lod, main="A Better view") points(d[o1], lod[o1], pch=18, col="blue") points(d[o2], lod[o2], pch=1, col="red") abline(h=2, v=c(-1, 1), col=2) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-volcanoplot2} \caption{Volcano plot for Brain vs. Universal Reference. The vertical red lines show the threshold for fold-change of 2 (up or down), while the horizontal red line shows the threshold for p-values at the $10^{-2}$ level. The probesets shown in solid blue diamonds are the top-25 probesets for log-ratio. The probesets highlighted in red are the top-25 in p-value.} \label{fig:maplot} \end{figure} The \Rpackage{limma} Package can also be used to assess difference in expression between the two groups. <>= design <- model.matrix(~factor(eset[["Key"]])) fit <- lmFit(eset, design) ebayes <- eBayes(fit) lod <- -log10(ebayes[["p.value"]][,2]) mtstat<- ebayes[["t"]][,2] @ The Empirical Bayes approach implemented in \Rpackage{limma} provides moderated $t$-statistic, shown to have a better performance when compared to the standard $t$-statistic. Below, we reconstruct the volcano plot, but using the moderated $t$-statisic. <>= o1 <- order(abs(d), decreasing=TRUE)[1:25] o2 <- order(abs(mtstat), decreasing=TRUE)[1:25] o <- union(o1, o2) smoothScatter(d, lod, main="Moderated t", xlab="Log-ratio", ylab="LOD") points(d[o1], lod[o1], pch=18,col="blue") points(d[o2], lod[o2], pch=1,col="red") abline(h=2, v=c(-1, 1)) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-volcanoplot3} \caption{Volcano plot for Brain vs. Universal Reference using moderated t-tests. The vertical red lines show the threshold for fold-change of 2 (up or down), while the horizontal red line shows the threshold for p-values at the $10^{-2}$ level. The probesets shown in solid blue diamonds are the top-25 probesets for log-ratio. The probesets highlighted in red are the top-25 in p-value (for the moderated t-test). Note that there is more overlap between the top-25 for both log-ratio and p-value.} \label{fig:maplot} \end{figure} The \Rmethod{topTable} command provides us a way of ranking genes for further evaluation. In the case below, we adjust for multiple testing by FDR and look at the Top-10 genes. <>= tab <- topTable(ebayes, coef=2, adjust="fdr", n=10) tab @ \section{Obtaining Genotype Calls from SNP Arrays} \label{sec:genotyping} The \Rpackage{oligo} package can genotype, using the CRLMM algorithm, several Affymetrix SNP arrays. To do so, the user will need, in addition to the \Rpackage{oligo} package, an annotation data package specific to the designed used in the experiment. Although these annotation packages are created using the \Rpackage{pdInfoBuilder} package, the CRLMM algorithm requires additional hand-curated data, which are included in the packages made available through the BioConductor website. Table~\ref{tab:snpPlatforms} describes the supported designs and the respective annotation packages. \begin{table}[!htp] \centering \begin{tabular}{|c|c|} \hline \textbf{Design} & \textbf{Annotation Package} \\ \hline Mapping 50K XBA & \Rpackage{pd.mapping50k.xba240} \\ Mapping 50K HIND & \Rpackage{pd.mapping50k.hind240} \\ Mapping 250K NSP & \Rpackage{pd.mapping250k.nsp} \\ Mapping 250K STY & \Rpackage{pd.mapping250k.sty} \\ Genomewide SNP 5.0 & \Rpackage{pd.genomewidesnp.5} \\ Genomewide SNP 6.0 & \Rpackage{pd.genomewidesnp.6} \\ \hline \end{tabular} \caption{SNP array designs currently supported by the \Rpackage{oligo} package and their respective annotation packages. These annotation packages are made available through the BioConductor website and contain hand-curated data, required by the CRLMM algorithm.} \label{tab:snpPlatforms} \end{table} As an example, we will use the 269 CEL files, on the XBA array, available on the HapMap website\footnote{\url{http://www.hapmap.org}}, which were downloaded and saved, uncompressed, to a subdirectory called \Robject{snpData}. Therefore, we need to instruct the software to look for the files at the correct location. An output directory should also be defined and that is the place where the summary files, including genotype calls and confidences are stored. This output directory, which we chose to call \Robject{crlmmResults}, must not exist prior to the CRLMM call, the software will take care of this task. <>= library("oligo") fullFilenames <- list.celfiles("snpData", full.names=TRUE) outputDir <- file.path(getwd(), "crlmmResults") @ Given the always increasing density of the SNP arrays, we developed efficient methods to process these chips, reducing the required amount of memory even for large studies. Using this approach, we process batches of SNPs at a time, saving partial results to disk. We refer the interested reader to \citet{Carvalho:2007p32} for detailed information on the CRLMM algorithm. The genotyping strategy, in summary, has three steps: A) quantile normalizes against a known reference distribution; B) summarizes the data to the SNP-allele level using median polish; C) uses estimated parameters to classify the samples in genotype groups using Mahalanobis distance. The summaries are average intensities and log-ratios, defined as across allele and within strand, ie: \begin{eqnarray} A_{s} & = & \frac{\theta_{A, s}+\theta_{B, s}}{2} \\ M_{s} & = & \theta_{A, s} - \theta_{B, s}, \end{eqnarray} where $s$ defines the strand (antisense or sense). On the genomewide designs, SNP 5.0 and 6.0, the strand information is dropped. These summaries can be obtained via \Rmethod{getA} and \Rmethod{getM} methods, which return arrays with dimensions corresponding to SNPs, samples and strands (if applicable), respectively. These measures are later used for genotyping. CRLMM involves running an EM algorithm to adjust for average intensity and fragment length in the log-ratio scale. These adjustments may take long time to run, depending on the combination of number of samples and computer resources available. Below, we show the simplest way to call CRLMM, which requires only the file names and output directory. <>= if (!file.exists(outputDir)) crlmm(fullFilenames, outputDir) @ The \Rfunction{crlmm} method does not return an object to the R session. Instead, it saves the objects to disk, as not all systems are guaranteed to meet the memory requirements that \Rclass{SnpSuperSet} objects might need. For the user's convenience, the \Rfunction{getCrlmmSummaries} will read the information from disk and make a \Rclass{SnpCallSetPlus} or \Rclass{SnpCnvCallSetPlus} object available to the user. <>= crlmmOut <- getCrlmmSummaries(outputDir) calls(crlmmOut[1:5,1:2]) confs(crlmmOut[1:5,1:2]) @ The genotype calls are represented by 1 (AA), 2 (AB) and 3 (BB). The confidence is the predicted probability that the algorithm made the right call. Summaries generated by the algorithm can also be accessed from the \R{} session. The options for summaries are \Rfunarg{"alleleA"}, \Rfunarg{"alleleB"}, \Rfunarg{"alleleA-sense"}, \Rfunarg{"alleleA-antisense"}, \Rfunarg{"alleleB-sense"}, \Rfunarg{"alleleB-antisense"}. The options \Rfunarg{"alleleA"} and \Rfunarg{"alleleB"} are only available for SNP 5.0 and SNP 6.0 platforms. The other options are to be used with 50K and 250K arrays. Below, we choose two SNPs to show the different configurations of the genotype groups. <>= snps <- paste("SNP_A-", c(1703121, 1725330), sep="") LIM <- c(-4, 4) @ Figure \ref{fig:goodSNP} represents a SNP for which genotyping is simplified by the good discrimination of both strands. Figure \ref{fig:senseSNP} shows a SNP for which features on the antisense strand have very good discrimination power, while no information (for classification) can be extracted from the sense strand. <>= gtypes <- as.integer(calls(crlmmOut[snps[1],])) plotM(crlmmOut, snps[1], ylim=LIM, xlim=LIM, col=gtypes) @ <>= gtypes <- as.integer(calls(crlmmOut[snps[2],])) plotM(crlmmOut, snps[2], ylim=LIM, xlim=LIM, col=gtypes) @ \begin{figure}[!htp] \begin{center} \subfigure[SNP\_A-1703121 has very good discrimination on both strands and, as competing algorithms, CRLMM has excelent performance on scenarios like this. On this plot, genotype calls provided by oligo are represented in different colors (black: AA; red: AB; green: BB)]{\label{fig:goodSNP} \includegraphics[width=.45\textwidth]{fig-goodSNP}} \subfigure[SNP\_A-1725330 presents poor discrimination on the sense strand. Because CRLMM does not average across strands, it can perfectly predict the genotype cluster each sample belongs to. On similar scenarios, competing algorithms are known to fail. Color scheme follows Figure~\ref{fig:goodSNP}.]{\label{fig:senseSNP} \includegraphics[width=.45\textwidth]{fig-senseSNP}} \end{center} \end{figure} CRLMM was shown to outperform competing genotyping tools. We refer the reader to \citet{Lin:2008p33} for further details on this subject. The genotypes provided by CRLMM, and in this example stored in \Robject{crlmmOut}, can be easily used with other BioConductor tools, like the \Rpackage{snpStats} package, for downstream analyses. \section{Preprocessing Exon Arrays} \label{sec:exon} On this section, we use colon cancer sample data for exon arrays, available on the Affymetrix website\footnote{\url{http://www.affymetrix.com/support/technical/sample_data/exon_array_data.affx}}, to demonstrate the use of the \Rpackage{oligo} package to preprocess these data. The interested reader can download the CEL files and use \Rfunction{read.celfiles} to import the data. Here, however, we will use the \Rpackage{oligoData} package to load this dataset, as shown below. <>= library(oligoData) data(affyExonFS) @ As already noted, \Rpackage{oligo} implements different classes depending on the nature of the data. Therefore, a quick inspection, as in the snippet below, shows that \Robject{affyExonFS} is an \Rclass{ExonFeatureSet} object. This is a especially interesting feature, as it allows methods to behave differently depending on the object class. <>= affyExonFS @ Generally, RMA will background correct, quantile normalize and summarize to the probeset level, as defined in the annotation packages. When working with an \Rclass{ExonFeatureSet} object, processing to the probeset level provides expression summaries at the exon level and can be obtained by setting the argument \Rfunarg{target} to \Robject{"probeset"}, as presented below. <>= probesetSummaries <- rma(affyExonFS, target="probeset") @ For Exon arrays, Affymetrix provides additional annotation files that define meta-probesets (MPSs), used to summarize the data to the gene level. These MPSs are classified in three groups -- core, extended and full -- depending on the level of confidence of the sources used to generate such annotations. Additional values allowed for the \Rfunarg{target} argument are \Robject{"core"}, \Robject{"extended"} and \Robject{"full"}. The example below shows how gene level summaries can be obtained through \Rpackage{oligo}. <>= geneSummaries <- rma(affyExonFS, target="core") @ The results obtained from analyses performed with \Rpackage{oligo} can be easily combined with features offered by other packages. As an example, we use the \Rpackage{biomaRt} package to obtain IDs of probesets on the Human Exon array that map to Entrez Gene ID~10948 (ENSG00000131748). <>= library(biomaRt) ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") theIDs <- getBM(attributes="affy_huex_1_0_st_v2", filters="entrezgene", values=10948, mart=ensembl) names(theIDs) <- 'psets' @ Combining this information with the annotation package associated to the data in \Robject{affyExonFS}, we can get detailed facts on the probesets found to map to Entrez Gene ID~10948. Below, we obtain, respectively, the MPS IDs, probeset IDs, probe IDs and start/stop positions for the probesets identified above. <>= library(AnnotationDbi) conn <- db(affyExonFS) fields <- 'meta_fsetid, pmfeature.fsetid, fid, start, stop' tables <- 'featureSet, pmfeature, core_mps' sql <- paste("SELECT", fields, "FROM", tables, "WHERE pmfeature.fsetid=featureSet.fsetid", "AND featureSet.fsetid=core_mps.fsetid", "AND pmfeature.fsetid=:psets") probesetInfo <- dbGetPreparedQuery(conn, sql, theIDs) @ The availability of start and stop positions of the probesets improves the visualization of the summaries at the exon level. If genomic coordinates were available for probes themselves, visualization could be improved even more. To achieve this, we first obtain the sequences for the probes identified above. We saw that the \Rmethod{pmSequence} method provides the sequences for all PM probes identified on the chip but, instead, we directly load the \Rpackage{Biostrings} object used to store the sequence information for these probes. This gives us access not only to the sequences, but also to the probe IDs linked to them. <>= library(Biostrings) data(pmSequence, package=annotation(affyExonFS)) @ Because probe IDs are available in the \Robject{pmSequence} object, we can easily restrict our search to the probes listed in the \Robject{probesetInfo} object. <>= idx <- match(probesetInfo[["fid"]], pmSequence[["fid"]]) pmSequence <- pmSequence[idx,] @ <>= rm(idx) @ The \Robject{pmSequence} object behaves like a \Rclass{data.frame}, but it is comprised of complex data structures defined in \Rpackage{Biostrings}. Below, we modify its representation to make it a regular \Rclass{data.frame} object. <>= pmSequence <- data.frame(fid=pmSequence[["fid"]], sequence = as.character(pmSequence[["sequence"]]), stringsAsFactors=FALSE) @ By joining the \Robject{probesetInfo} and \Robject{pmSequence} objects, we centralize the available probe annotation. <>= probeInfo <- merge(probesetInfo, pmSequence) @ <>= rm(probesetInfo, pmSequence) @ The genomic coordinates in \Robject{probeInfo} refer to the probesets. To better visualize the observed probe intensities, we would be better off if the coordinates were relative to the probes. Below, we use the \Rpackage{BSgenome.Hsapiens.UCSC.hg18} to obtain up-to-date genomic coordinates. The coordinates are found by aligning the probe sequences to the reference genome made available through the package. Because Entrez Gene ID~10948 is located on chromosome~17, the search is limited to this region. <>= library("BSgenome.Hsapiens.UCSC.hg19") chr17 <- Hsapiens[["chr17"]] seqs <- complement(DNAStringSet(probeInfo[["sequence"]])) seqs <- PDict(seqs) matches <- matchPDict(seqs, chr17) @ <>= rm(seqs, chr17) @ After matching the sequences, we update the genomic coordinates. <<>>= probeInfo[["start"]] <- unlist(startIndex(matches)) probeInfo[["stop"]] <- unlist(endIndex(matches)) @ <>= rm(matches) @ With the updated coordinates, we reorder the probe information object, \Robject{probeInfo}, and extract the probe intensities in the same order. The probe ID field, \Robject{fid} in \Robject{probeInfo}, provides direct access to the probes of interest. The \Rmethod{exprs} method is used to access the intensity matrix of the \Robject{affyExonFS} object and immediately subsetted to the probes of interest. After subsetting the observed intensities, we $\log_2$-transform the data. <>= probeInfo <- probeInfo[order(probeInfo[["start"]]),] probeData <- exprs(affyExonFS)[probeInfo[["fid"]],] probeData <- log2(probeData) @ We use the updated genomic to estimate the probeset coverage. This information will be used when plotting the data and will provide approximate delimiters of the probesets. <>= attach(probeInfo) probesetStart <- aggregate(as.data.frame(start), list(fsetid=fsetid), min) names(probesetStart) <- c("fsetid", "start") probesetStop <- aggregate(as.data.frame(stop), list(fsetid=fsetid), max) names(probesetStop) <- c("fsetid", "stop") detach(probeInfo) @ The \Robject{psInfo} object will store the probeset information (probeset ID, start and stop positions), as shown below. After ordering appropriately the data, the \Robject{psInfo} probeset is attached, to simplify its usage during the \R{} session. <>= psInfo <- merge(probesetStart, probesetStop) psInfo <- psInfo[order(psInfo[["start"]]),] psInfo[["fsetid"]] <- as.character(psInfo[["fsetid"]]) attach(psInfo) probesetData <- exprs(probesetSummaries[fsetid,]) detach(psInfo) @ <>= rm(probesetStart, probesetStop) @ To visualize the data processed by \Rpackage{oligo}, we will use the \Rpackage{GenomeGraphs} package. To match the genome build used to update the probe coordinates, an archived version of the database will be queried. <>= library(GenomeGraphs) probeids <- as.character(probeInfo[["fsetid"]]) ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl") geneid <- "ENSG00000131748" title <- makeTitle(text=geneid, color="darkred") @ The raw data, in the $\log_2$ scale, will be represented by the \Robject{raw} object below, created with the \Rmethod{makeExonArray} constructor. <>= attach(probeInfo) raw <- makeExonArray(intensity=probeData, probeStart=start, probeEnd=stop, probeId=probeids, nProbes=rep(1, nrow(probeInfo)), dp=DisplayPars(color="blue", mapColor="dodgerblue2"), displayProbesets=FALSE) detach(probeInfo) @ The summarized data is also represented through an object created by \Rmethod{makeExonArray}. The structure is identical to the one used above. <>= attach(psInfo) exon <- makeExonArray(intensity=probesetData, probeStart=start, probeEnd=stop, probeId=fsetid, nProbes=rep(1, nrow(psInfo)), dp=DisplayPars(color="seagreen", mapColor="seagreen"), displayProbesets=FALSE) @ To represent the probesets designed by Affymetrix, we use an \Rclass{AnnotationTrack} object. <>= affyModel <- makeAnnotationTrack(start = start, end = stop, feature = "gene_model", group = geneid, dp = DisplayPars(gene_model="darkgreen")) detach(psInfo) @ The gene and transcripts representations are build as follows. Affymetrix probes will be represented in green, while the gene will be in orange; transcripts are represented in blue. <>= gene <- makeGene(id=geneid, biomart=ensembl) transcript <- makeTranscript(id=geneid, biomart=ensembl) legend <- makeLegend(c("Affymetrix", "Gene"), fill=c("darkgreen", "orange")) @ Figure~\ref{fig:ENSG131748}, generated with the \Rfunction{gdPlot} function, shows the representation of the $\log_2$-intensities and summaries at the exon level. It also shows probesets, gene and transcripts on the region of interest. <>= gdPlot(list(title, raw, exon, affyModel, gene, transcript, legend)) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-ENSG131748} \caption{Visual representation of observed $\log_2$-intensities and summarized data at the exon level for gene ENSG00000131748. The probes, gene and transcript are also represented, respectively, in green, orange and blue.} \label{fig:ENSG131748} \end{figure} Below, we identify the meta-probeset ID associated to the probes used above. Once that is known, we can extract the proper gene-level summaries stored in \Robject{geneSummaries}. <>= mps <- unique(probeInfo[["meta_fsetid"]]) mps <- as.character(mps) mps @ Therefore, the standard accessors can be used to obtain the gene summaries for the unit above. Figure~\ref{fig:scatterENSG131748} shows the expressions for gene ENSG00000131748 across the 33 samples available on this dataset. <>= gSummaries <- exprs(geneSummaries[mps,]) x <- 1:length(gSummaries) plot(x, gSummaries, xlab="Sample", ylab="Expression", main=geneid) @ \begin{figure}[!htp] \centering \includegraphics[width=.45\textwidth]{fig-geneLevelSummaries} \caption{Expression levels estimated through RMA at the gene level.} \label{fig:scatterENSG131748} \end{figure} \section{Interfacing with \Rpackage{ACME} to Find Enriched Regions Using Tiling Arrays} \label{sec:tiling} On this Section, we demonstrate how \Rpackage{oligo} can be easily combined with tools that rely on the structure implemented in the \Rpackage{Biobase} package. Using a sample ChIP-chip dataset kindly provided by NimbleGen, we could use the \Rfunction{getNgsColorsInfo} function to obtain the information regarding channels and sample names for the XYS files saved on disk. The \Rfunction{getNgsColorsInfo} parses the file names and, using the \Robject{\_532} and \Robject{\_635} strings in the names, suggests channels and sample names for each XYS file available. <>= library(oligo) info <- getNgsColorsInfo("tilingData", full=TRUE) @ Combining the results in \Robject{info} with \Rfunction{read.xyfiles2}, we read the XYS files using a data structure that simplifies the data management across different channels. <>= nimbleTilingFS <- read.xysfiles2(info[,2], info[,1], sampleNames=info[,3]) @ However, on this example, we will load the aforementioned dataset from the \Rpackage{oligoData} package, as described below: <>= library(oligoData) data(nimbleTilingFS) nimbleTilingFS @ The user can access the channel specific data by calling the \Rmethod{channel} method. The resulting object is an \Rclass{ExpressionSet} object that the user can use as required. <>= c1 <- channel(nimbleTilingFS, "channel1") c2 <- channel(nimbleTilingFS, "channel2") @ Detailed information on the PM probes available on the array can be obtained by directly querying the annotation package. The call below will extract the \Robject{fid}, \Robject{fsetid}, \Robject{chromosome} and \Robject{start} position of each probe from the annotation package and order the results by chromosome and start position. <>= fields <- 'fid, fsetid, chrom as chromosome, position as start' sql <- paste("SELECT", fields, "FROM pmfeature INNER JOIN featureSet USING(fsetid)", "ORDER BY chrom, position") annotPM <- dbGetQuery(db(nimbleTilingFS), sql) @ <>= rm(sql, c1, c2) @ Using the probe sequence, the end position of the probe can be easily obtained. We load the sequences directly, so the \Robject{fid} field can be used to order the sequences appropriately. <>= data(pmSequence, package=annotation(nimbleTilingFS)) idx <- match(annotPM[["fid"]], pmSequence[["fid"]]) pmSequence <- pmSequence[idx,] @ To obtain the end position, we use \Rmethod{width}, defined in the \Rpackage{Biostrings} package. <>= attach(annotPM) library(Biostrings) annotPM[["end"]] <- start+width(pmSequence[["sequence"]])-1 head(annotPM) @ The \Robject{fid} field corresponds to the row number in the \Robject{nimbleTilingFS} object. When applied to the raw data object, the \Rmethod{getM} function returns a matrix with the $\log_2$-ratio of the intensities. Below, we get the $\log_2$-ratios corresponding to the PM probes described in the \Robject{annotPM} object. <>= ratioPM <- getM(nimbleTilingFS)[fid,] dimnames(ratioPM) <- NULL detach(annotPM) class(ratioPM) @ By converting \Robject{annotPM} to an \Rclass{AnnotatedDataFrame}, it can be used in the \Rclass{featureData} slot of \Rclass{eSet}-like objects. <>= annotPM <- as(annotPM, "AnnotatedDataFrame") @ We will use the \Rpackage{ACME} package to calculate enrichment, using algorithms that are insensitive to normalization strategies and array noise. To work with this package, we need to create, first, an \Rclass{ACMESet} object, which contains \Robject{chromosome}, \Robject{start} and \Robject{end} positions in the \Robject{featureData} slot. <>= library(ACME) acme <- new("ACMESet", exprs=ratioPM, featureData=annotPM) @ The \Rfunction{do.aGFF.calc} function processes the \Rclass{ACMESet} object, using a window size and threshold to identify the positive probes in the object. <>= calc <- do.aGFF.calc(acme, window=1000, thresh=0.95) @ The \Robject{calc} object is then used to find enriched regions with the \Rfunction{findRegions} function, as shown below. <>= regs <- findRegions(calc) head(regs) @ \section{High Performance Computing Features} Starting on series 1.12.x, the \Rpackage{oligo} package offers high performance computing features: \begin{itemize} \item \textbf{Support to larger datasets; and} \item \textbf{Support to parallel computing.} \end{itemize} These features are initially available for RMA methods on Expression/Gene/Exon arrays and will be implemented in other methods as necessity arrives. The use of such features is as simple as loading the required packages (and registering a parallel backend, if parallel computing is desired). The methods themselves are able to detect if these experimental features are enabled and use them if possible, without any modification of the method call. \subsection{Support to large datasets} The \oligo package uses the features implemented by the \Rpackage{ff} package to provide a better support to large datasets. If the user prefers not to use the \Rpackage{ff} package, then regular R objects are used and the usual memory restrictions apply. The support to large datasets is enabled by simply loading the \Rpackage{ff} package. Once that is done, \Rpackage{oligo} saves ff files to the directory pointed by \Rcode{ldPath()}. <>= library(oligo) library(ff) ldPath() @ Methods (\Rmethod{rma}) uses batches to process data. When possible (eg., background correction), it uses at most \Rcode{ocSamples()} samples simultaneously at processing. For procedures that process probes (like summarization), a maximum of \Rcode{ocProbesets()} are used simultaneously. Therefore, the user should tune these parameters for a better performance. <>= ocSamples() ocSamples(50) ## changing default to 50 ocProbesets() ocProbesets(100) ## changing default to 100 @ <>= library(oligo) library(ff) rawData <- read.celfiles(list.celfiles()) rmaRes <- rma(rawData) exprs(rmaRes)[1:10,] @ \subsection{Parallel computing} The \oligo package can make use of a parallel environment (with \Rcode{rma} in the meantime) set via \Rpackage{foreach} package, as long as the user: \begin{itemize} \item enables support to large datasets (load \Rpackage{ff}); \item loads the \Rpackage{foreach} package; \item register a parallel backend (for example, through one of the \Rpackage{doMPI}, \Rpackage{doMC}, \Rpackage{doSNOW} packages). \end{itemize} A simple example is shown below: <>= library(ff) library(foreach) library(doMC) registerDoMC(2) library(oligo) @ <>= rawData <- read.celfiles(list.celfiles()) rmaRes <- rma(rawData) rmaRes @ \subsection{Parallel Computing on Multicore Machines} On multicore machines, one alternative for parallel preprocessing is shown below. It assumes that the machine has enough RAM to deal with the dataset and that the \Rpackage{ff} package is \textbf{NOT} loaded. The snippet compares the performance between a single-threaded run of \Rmethod{rma}, although \Rmethod{fitProbeLevelModel} would also benefit from it, and a run using 4 threads (which is enabled by setting the \Rcode{R\_THREADS} environment variable). <<>>= library(oligoData) data(affyExonFS) t0 <- system.time(res0 <- rma(affyExonFS)) Sys.setenv(R_THREADS=4) t1 <- system.time(res1 <- rma(affyExonFS)) all.equal(res0, res1) t0 t1 @ \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliographystyle{plainnat} \bibliography{FrameworkPreprocessing} \end{document}