%\VignetteIndexEntry{Infrastructure for copy number analysis} %\VignetteDepends{crlmm, genomewidesnp6Crlmm} %\VignetteKeywords{crlmm, SNP 6} %\VignettePackage{crlmm} \documentclass{article} \usepackage{graphicx} \usepackage{natbib} \usepackage{url} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\oligo}{\Rpackage{oligo }} \newcommand{\R}{\textsf{R}} \newcommand{\crlmm}{\Rpackage{crlmm}} \newcommand{\ff}{\Rpackage{ff}} \usepackage[margin=1in]{geometry} \begin{document} \title{Infrastructure for copy number analysis in \crlmm{}} \date{\today} \author{Rob Scharpf} \maketitle \begin{abstract} This vignette provides an overview of the \Rclass{CNSet} class and a brief discussion of the underlying infrastructure for large data support with the \Rpackage{ff} package. This vignette instantiates an object of class \Rclass{CNSet} using a trivial dataset with 3 files. As this sample size is too small for estimating copy number with the \crlmm{} package, the final section of this vignette loads an object created by the analysis of 180 HapMap CEL files (Affymetrix 6.0 platform). This object was instantiated by running the \verb+AffyGW+ vignette. \end{abstract} <>= library(ff) library(crlmm) @ \section{Supported platforms} The supported Affymetrix and Infinium platforms are those for which a corresponding annotation package is available. The annotation packages contain information on the markers, such as physical position and chromosome, as well as pre-computed parameters estimated from HapMap used during the preprocessing and genotyping steps. For Affymetrix, the 5.0 and 6.0 platforms are supported and the corresponding annotation packages are \Rpackage{genomewidesnp5Crlmm} and \Rpackage{genomewidesnp6Crlmm}. Supported Infinium platforms are listed in the following code chunk. <>= pkgs <- annotationPackages() crlmm.pkgs <- pkgs[grep("Crlmm", pkgs)] crlmm.pkgs[grep("human", crlmm.pkgs)] @ \textbf{Large data:} In order to reduce \crlmm{}'s memory footprint for copy number estimation, we require the \Rpackage{ff}. The \Rpackage{ff} package provides infrastructure for accessing and writing data to disk instead of keeping data in memory. As the functions for preprocessing, genotyping, and copy number estimation do not simultaneously require all samples and all probes in memory, memory-usage by \crlmm{} can be fine-tuned by reading in and processing subsets of the markers and/or samples. The functions \Rfunction{ocSamples} and \Rfunction{ocProbesets} in the \Rpackage{oligoClasses} package can be used to declare how many markers and samples to read at once. In general, specifying smaller values should reduce the RAM required for a particular job. In general, smaller values will increase the run-time. In the following code-chunk, we declare that \crlmm{} should process 100,000 markers at a time (when possible) and 200 samples at a time. If our dataset contained fewer than 200 samples, the \Rfunction{ocSamples} option would not have any effect. One can view the current settings for these commands, by typing the functions without an argument. <>= ocProbesets(100e3) ocSamples(200) @ \section{The \Rclass{CNSet} container} \subsection{Instantiating an object of class \Rclass{CNSet}} An object of class \Rclass{CNSet} can be instantiated by one of two methods: \begin{itemize} \item[Approach 1:] during the preprocessing of the raw intensities for Illumina and Affymetrix arrays by the the functions \Rfunction{constructInf} and \Rfunction{genotype}, respectively. (The \Rfunction{genotype} calls the function \Rfunction{constructAffy} to initialize a \Rclass{CNSet} object for Affymetrix platforms.) \item[Approach 2:] by subsetting an existing \Robject{CNSet} object. As per usual, the `[' method can be used to extract a subset of markers $i$ as in `[i, ]', a subset of samples $j$ as in `[, j]', or a subset of markers $i$ and samples $j$ as in `[i, j]'. \end{itemize} There are important differences in the underlying data representation depending on how the \Rclass{CNSet} object was instantiated. In particular, objects generated by the functions \Rfunction{constructInf} and \Rfunction{genotype} store high-dimensional data on disk rather than in memory through protocols defined in the \R{} package \ff{}. For instance, the normalized intensities and genotype calls in a \Rclass{CNSet}-instance from approach (1) are \ff{}-derived objects. By contrast, when such an objected generated by approach (1) is subset by the `[' method, an object of the same class is returned but the \Rpackage{ff}-derived objects are coerced to ordinary matrices. Note, therefore, that both approaches (1) and (2) may involve substantial I/O and that (2) should be performed judiciously. \subsubsection{Approach 1} To illustrate the first approach, we begin by specifying a local directory to store output files and setting the \Rfunction{ldPath} function with this path. <>= outdir <- paste("/local_data/r00/crlmm/", getRversion(), "/infrastructure", sep="") ldPath(outdir) if(!file.exists(outdir)) dir.create(outdir) @ Next we load the annotation package required, as well as the \R{} package \Rpackage{hapmapsnp6} that contains 3 example CEL files. We use the \Rfunction{system.file} function to find the path to the CEL files. <>= require(genomewidesnp6Crlmm) & require(hapmapsnp6) path <- system.file("celFiles", package="hapmapsnp6") celfiles <- list.celfiles(path, full.names=TRUE) @ Typically, an object of class \Rclass{CNSet} is instantiated as part of the preprocessing and genotyping by calling the function \Rfunction{genotype}, as illustrated in the \verb+AffyGW+ vignette. <>= exampleSet <- genotype(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") @ Several files with \verb+.ff+ extensions now appear in the directory indicated by the \Rfunction{ldPath} function. <>= ldPath() @ One could also instantiate an object of class \Rclass{CNSet} without preprocessing/genotyping by calling the exported function \Rfunction{constructAffy} directly using the \Rfunction{:::} operator. <>= tmp <- constructAffy(celfiles, batch=rep("1", 3), cdfName="genomewidesnp6") @ The \Rfunction{show} method provides a concise summary of the \Robject{exampleSet} object. Note the class of the elements in the \verb+batchStatistics+ and \verb+assayData+ slots is indicated in the first line of the summary. <>= invisible(open(exampleSet)) exampleSet @ % We briefly outline some of the unique aspects of the % \Rclass{CNSet}-class using in \crlmm{} that may differ from the more % standard extensions of the virtual class \Rclass{eSet} defined in % the \Rpackage{Biobase} package. As the \verb+assayData+ elements of the \Robject{exampleSet} object are stored on disk rather than in memory, we inspect attributes of the elements by first opening the file connection using the \Rfunction{open}. For example, in the following code we extract the normalized intensities for the A allele by first opening the object returned by the \Rfunction{A} function. The name of the file where the data is stored on disk is provided by the \Rfunction{filename}. Finally, the normalized intensities can be pulled from disk to memory by the `[' method. It can be useful to wrap the `[' method by the \Rfunction{as.matrix} to ensure that the output is the desired class. <>= invisible(open(exampleSet)) class(A(exampleSet)) filename(A(exampleSet)) as.matrix(A(exampleSet)[1:5, ]) @ \paragraph{Moving \texttt{*.ff} files} Ideally, the files with \verb+.ff+ extension should not be moved. However, if this is not possible, the safest way to move these files is to clone all of the \ff{} objects using the \Rfunction{clone}, followed by the \Rfunction{delete} function to remove the original files on disk. An example of the \Rfunction{delete} function is included at the end of the \verb+IlluminaPreprocessCN+ vignette. See the documentation for the \Rfunction{clone} and \Rfunction{delete} functions in the \ff{} package for additional details. \paragraph{Order of operations:} For \Rclass{CNSet}-instances derived by approach (1), users should be mindful of the substantial I/O when using accessors to extract data from the class. For example, the following 2 methods would extract identical results, with the latter being much more efficient (extra parentheses are added to the second operation to emphasize the order of operations): <>= ##inefficient ##invisible(open(cnSet)) A(exampleSet[1:5, ]) ## preferred (A(exampleSet))[1:5, ] @ \subsubsection{Approach 2: using `['} Here we instantiate a new object of class \Rclass{CNSet} by applying the `[' method to an existing object of class \Rclass{CNSet}. <>= cnset.subset <- exampleSet[1:5, ] @ Note the class of the \verb+batchStatistics+ and \verb+assayData+ elements of the \Robject{cnset.subset} object printed in the first line of the summary. <>= show(cnset.subset) @ \subsection{Slots of class \Rclass{CNSet}} \subsubsection{\texttt{featureData}} Information on physical position, chromosome, and whether the marker is a SNP can be accessed through accessors defined for the \verb+featureData+. <>= library(Biobase) fvarLabels(exampleSet) position(exampleSet)[1:10] chromosome(exampleSet)[1:10] is.snp <- isSnp(exampleSet) table(is.snp) snp.index <- which(is.snp) np.index <- which(!is.snp) chr1.index <- which(chromosome(exampleSet) == 1) @ \subsubsection{\texttt{assayData}} The \verb+assayData+ elements are of the class \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how the \Rclass{CNSet} object was instantiated. Elements in the \verb+assayData+ environment can be listed using the \Rfunction{ls} function. <>= ls(assayData(exampleSet)) @ The normalized intensities for the A and B alleles have names \verb+alleleA+ and \verb+alleleB+ and can be accessed by the methods \Rfunction{A} and \Rfunction{B}, respectively. Genotype calls and confidence scores can be accessed by \Rfunction{snpCall} and \Rfunction{snpCallProbability}, respectively. Note that the confidence scores are represented as integers to reduce the filesize, but can be tranlated to the probability scale using the \R{} function \Rfunction{i2p}. For example, <>= scores <- as.matrix(snpCallProbability(exampleSet)[1:5, 1:2]) i2p(scores) @ Note that for the Affymetrix 6.0 platform the assay data elements each have a row dimension corresponding to the total number of polymorphic and nonpolymorphic markers interrogated by the Affymetrix 6.0 platform. A consequence of keeping the rows of the assay data elements the same for all of the statistical summaries is that the matrix used to store genotype calls is larger than necessary. \subsubsection{\texttt{batch} and \texttt{batchStatistics}} As defined in Leek \textit{et al.} 2010, \textit{Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study}. The \verb+batchStatistics+ slot is an environment used to store SNP- and batch-specific summaries, such as the sufficient statistics for the genotype clusters and the linear model parameters used for copy number estimation. The \verb+batch+ slot is used to store the 'batch name' for each array. For small studies in which the samples were processed at similar times (e.g., within a month), all the samples can be considered to be in the same batch. For large studies in which the samples were processed over several months, users should the scan date of the array or the chemistry plate are useful surrogates. The only constraint on the \verb+batch+ variable is that it must be a character vector that is the same length as the number of samples to be processed. The \verb+batch+ is specified as an argument to the \R{} functions \Rfunction{constructInf} and \Rfunction{genotype} that instantiate \Rclass{CNSet} objects for the Illumina and Affymetrix platforms, respectively. The \Rfunction{batch} function can be used to access the \verb+batch+ information on the samples as in the following example. <>= batch(exampleSet) @ For the \verb+batchStatistics+ slot, the elements in the environment have the class \Rclass{ff\_matrix}/\Rclass{ffdf} or \Rclass{matrix}, depending on how the \Rclass{CNSet} object was instantiated. The dimension of each element is the number of markers (SNPs + nonpolymorphic markers) $\times$ the number of batches. The names of the elements in the environment can be list using the \R{} function \Rfunction{ls}. <>= ls(batchStatistics(exampleSet)) @ Currently, the batch-specific summaries are stored to allow some flexibility in the choice of downstream analyses of copy number and visual assessments of model fit. Documentation for such applications will be expanded in future versions of \crlmm{}, and are currently not intended to be accessed directly by the user. %The \Robject{CNSet} class also contains the slot %\Robject{batchStatistics} that contains batch-specific summaries %needed for copy number estimation. In particular, each element is a %matrix (or an ff object) with R rows and C columns, correspoinding to %R markers and C batches. The summaries includes the within genotype %cluster medians and median absolute deviations (mads), but also %parameters estimated from the linear model. (For unobserved %genotypes, the medians are imputed and the variance is obtained the %median variance (across markers) within a batch. ) The elements of the %slot can be listed as follows. \subsubsection{\texttt{phenoData}} Sample-level summaries obtained during the preprocessing/genotyping steps include skew (SKW), the signal to noise ratio (SNR), and gender (1=male, 2=female) are stored in the \verb+phenoData+ slot. As for other \Rclass{eSet} extensions, the \verb+$+ method can be used to extract these summaries. For \Rclass{CNSet} objects generated by approach (1), these elements are of the class \Rclass{ff\_vector}. <>= varLabels(exampleSet) class(exampleSet$gender) invisible(open(exampleSet$gender)) exampleSet$gender @ The `[' methods without arguments can be used to coerce to a vector. <>= c("male", "female")[exampleSet$gender[]] invisible(close(exampleSet$gender)) @ \subsubsection{\texttt{protocolData}} The scan date of the arrays are stored in the \verb+protocolData+. <>= varLabels(protocolData(exampleSet)) protocolData(exampleSet)$ScanDate @ \section{Trouble shooting with a HapMap example} This section uses an object of class \Rclass{CNSet} instantiated by the \verb+AffyGW+ vignette and saved to a local path on our computing cluster indicated by the object \Robject{outdir} below. The \verb+copynumber+ vignette was used to fill out the \verb+batchStatistics+ slot of the \Robject{cnSet} object. <>= if(getRversion() < "2.13.0"){ rpath <- getRversion() } else rpath <- "trunk" outdir <- paste("/thumper/ctsa/snpmicroarray/rs/ProcessedData/crlmm/", rpath, "/copynumber_vignette", sep="") @ Next, we load the \Robject{cnSet} object. <>= if(!exists("cnSet")) load(file.path(outdir, "cnSet.rda")) invisible(open(cnSet)) @ \subsection{Missing values} Most often, missing values occur when the genotype confidence scores for a SNP were below the threshold used by the \Robject{crlmmCopynumber} function. For the HapMap analysis, we used a confidence threshold of 0.80 (the default). In the following code, we assess NA's appearing for the raw copy number estimates for the first 10 samples. <>= GT.CONF.THR <- 0.80 autosome.index <- which(isSnp(cnSet) & chromosome(cnSet) < 23) sample.index <- 1:10 ct <- totalCopynumber(cnSet, i=autosome.index, j=sample.index) ca <- CA(cnSet, i=autosome.index, j=sample.index) cb <- CB(cnSet, i=autosome.index, j=sample.index) missing.ca <- is.na(ca) missing.cb <- is.na(cb) (nmissing.ca <- sum(missing.ca)) (nmissing.cb <- sum(missing.cb)) identical(nmissing.ca, nmissing.cb) @ If \Robject{nmissing.ca} is nonzero, check the genotype confidence scores provided by the crlmm genotyping algorithm against the threshold specified in \Robject{crlmmCopynumber}. <>= if(nmissing.ca > 0){ ##invisible(open(snpCallProbability(cnSet))) gt.conf <- i2p(snpCallProbability(cnSet)[autosome.index, sample.index]) ##invisible(close(snpCallProbability(cnSet))) below.thr <- gt.conf < GT.CONF.THR index.allbelow <- as.integer(which(rowSums(below.thr) == length(sample.index))) nmissingBecauseOfGtThr <- length(index.allbelow) * length(sample.index) stopifnot(identical(nmissingBecauseOfGtThr, nmissing.ca)) ## or calculate the proportion of missing effected by low crlmm confidence length(index.allbelow) * length(sample.index)/nmissing.ca } @ One could inspect the cluster plots for the 'low confidence' calls. <>= ## TODO @ We repeat the above check for missing values at polymorphic loci on chromosome X. In this case, we compare the \Robject{rowSums} of the missing values to the number of samples to check whether all of the estimates are missing for a given SNP. <>= ## start with first batch sample.index <- which(batch(cnSet)==batch(cnSet)[1]) X.index <- which(isSnp(cnSet) & chromosome(cnSet) == 23) ca.X <- CA(cnSet, i=X.index, j=sample.index) missing.caX <- is.na(ca.X) (nmissing.caX <- sum(missing.caX)) missing.snp.index <- which(rowSums(missing.caX) == length(sample.index)) index <- which(rowSums(missing.caX) == length(sample.index)) ##length(index)*length(sample.index)/nmissing.caX @ From the above codechunk, we see that \Sexpr{length(index)} SNPs have NAs for all the samples. Next, we tally the number of NAs for polymorphic markers on chromosome X that are below the confidence threshold. For the HapMap analysis, all of the missing values arose from SNPs in which either the men or the women had confidence scores that were all below the threshold. <>= invisible(open(cnSet$gender)) F <- which(cnSet$gender[sample.index] == 2) M <- which(cnSet$gender[sample.index] == 1) gt.conf <- i2p(snpCallProbability(cnSet)[X.index, sample.index]) below.thr <- gt.conf < GT.CONF.THR index.allbelowF <- as.integer(which(rowSums(below.thr[, F]) == length(F))) index.allbelowM <- as.integer(which(rowSums(below.thr[, M]) == length(M))) all.equal(index.allbelowF, index.allbelowM) all.equal(as.integer(index.allbelowF), as.integer(index)) @ For nonpolymorphic loci, the genotype confidence scores are irrelevant and estimates are available at most markers. <>= np.index <- which(!isSnp(cnSet) & chromosome(cnSet)==23) ca.F <- CA(cnSet, i=np.index, j=F) ca.M <- CA(cnSet, i=np.index, j=M) ## NAs for one marker ca.F <- ca.F[-match("CN_974939", rownames(ca.F)), ] ca.M <- ca.M[-match("CN_974939", rownames(ca.M)), ] sum(is.na(ca.F)) sum(is.na(ca.M)) @ %TODO: marker CN\_974939 has NAs for the normalized intensities. This %is because CN\_974939 is not in the \Robject{npProbesFid} file in %\Rpackage{genomewidesnp6Crlmm}. The \Robject{npProbesFid} file should %be updated in the next \Rpackage{genomewidesnp6Crlmm} release. In total, there were \Sexpr{length(missing.snp.index)} polymorphic markers on chromosome X for which copy number estimates are not available. Lowering the confidence threshold would permit estimation of copy number at most of these loci. A confidence threshold is included as a parameter for the copy number estimation as an approach to reduce the sensitivity of genotype-specific summary statistics, such as the within-genotype median, to intensities from samples that do not clearly fall into one of the biallelic genotype clusters. There are drawbacks to this approach, including variance estimates that can be a bit optimistic at some loci. More direct approaches for outlier detection and removal may be explored in the future. Copy number estimates for other chromosomes, such as mitochondrial and chromosome Y, are not currently available in \crlmm{}. <>= invisible(close(cnSet)) invisible(close(cnSet$gender)) @ \section{Session information} <>= toLatex(sessionInfo()) @ %\section*{References} % %%\begin{bibliography} % \bibliographystyle{plain} % \bibliography{refs} %\end{bibliography} \end{document}