% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\bi}{\begin{itemize}} \newcommand{\ei}{\end{itemize}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{BioC 2008 practical: Machine learning with genome-scale data} \author{\copyright 2008 VJ Carey stvjc at channing.harvard.edu} \maketitle \tableofcontents \section{Introduction} The term \textit{machine learning} refers to a family of computational methods for analyzing multivariate datasets. Each data point has a vector of \textit{features} in a shared \textit{feature space}, and may have a \textit{class label} from some fixed finite set. \textit{Supervised learning} refers to processes that help articulate rules that map \textit{feature vectors} to \textit{class labels}. The class labels are known and function as supervisory information to guide rule construction. \textit{Unsupervised learning} refers to processes that discover structure in collections of feature vectors. Typically the structure consists of a grouping of objects into clusters. Some basic points to consider at the start: \bi \item Distinguish predictive modeling from inference on model parameters. Typical work in epidemiology focuses on estimation of relative risks, and random samples are not required. Typical work with machine learning tools targets estimation (and minimization) of the misclassification rate. Representative samples are required for this task. \item All prediction or clustering algorithms, like all modeling procedures, rely on a choice of distance metric that permits quantitative evaluation of similarity and dissimilarity among objects. The choice can affect results and there is typically no \textit{a priori} basis for selecting the distance function. A one-minus-correlation distance can be very different from euclidean distance for a given pair of genes. \item ``Two cultures'': model fitters vs. algorithmic predictors. If statistical models are correct, parameter estimation based on the mass of data can yield optimal discriminators (e.g., LDA). Algorithmic discriminators tend to prefer to identify boundary cases and downweight the mass of data (e.g., boosting, svm). \item Different learning tools have different capabilities. There is little \textit{a priori} guidance on matching learning algorithms to aspects of problems. While it is convenient to sift through a variety of approaches, one must pay a price for the model search. \item Data and model/learner visualization are important, but visualization in higher dimensional data structures is hard. Dynamic graphics can help; look at ggobi and Rggobi for this. \item These notes provide very little mathematical background on the methods; see for example Ripley (\textit{Pattern recognition and neural networks}, 1995), Duda, Hart, Stork (\textit{Pattern classification}), Hastie, Tibshirani and Friedman (2003, \textit{Elements of statistical learning}) for extensive background. \ei \section{Data structures} The representation of genome-scale data has impacts on many aspects of data analysis. For microarray measures of mRNA abundance (expression arrays) we use the \texttt{ExpressionSet} to unify data and metadata on a set of arrays. Let $G$ denote the number of genes probed on the array, and $N$ denote the number of samples which will be assumed independent (familial data structures not directly considered). The key points for an \texttt{ExpressionSet} instance \texttt{X} are \begin{itemize} \item \texttt{exprs(X)} is a $G \times N$ \texttt{matrix} of expression values (typically on the log scale) \item \texttt{pData(X)} is a $N \times R$ \texttt{data.frame} of sample-level variables \item \verb+X$v+ is an $N$-vector of values on the sample-level variable named \texttt{v} \item \texttt{X[G, S]} is a new \texttt{ExpressionSet} instance with genes restricted according to predicate \texttt{G} and samples restricted according to predicate \texttt{S} \end{itemize} Our first example is the Chiaretti et al. dataset on acute lymphocytic leukemia. <>= if (!("package:ALL" %in% search())) library(ALL) if (!(exists("ALL"))) data(ALL) <>= library(ALL) data(ALL) ALL @ We will focus on the molecular classification of leukemia-type within B-cell leukemias, and create a subset of B-cell ALL samples that are either positive or negative for BCR/ABL gene fusion <>= table(ALL$BT, ALL$mol.biol) bALL = ALL[, substr(ALL$BT, 1, 1) == "B"] fbALL = bALL[, bALL$mol.biol %in% c("BCR/ABL", "NEG")] fbALL$mol.biol = factor(fbALL$mol.biol, levels=c("NEG", "BCR/ABL")) fbALL$binFus = 1*(fbALL$mol.biol == "BCR/ABL") @ In the last assignment, we make a 0-1 representation of the mol.biol factor. \section{Logistic regression and linear discriminant analysis} A standard analysis of a problem with a two-class outcome focuses on modeling the probability of class membership. With our representation, we consider the probability that a sample is positive for BCR/ABL fusion (this event is denoted $F=1$, $F$ a binary random variable) conditional on the level of expression of a selected gene. A linear logistic model takes the form \[ \mbox{logit Pr}(F = 1|x) = \alpha + x\beta \] \subsection{Manual fit of a logistic regression} This can be fit manually using R as follows. We use the third gene on the array as $x$. <>= lr1 = glm(formula = fbALL$binFus ~ exprs(fbALL)[3, ], family = binomial) summary(lr1) @ \subsubsection*{Exercises} \begin{itemize} \item Visualize the class-specific distributions of the modeled gene. \item What is the name of the gene used in this analysis? \end{itemize} \subsection{Using MLInterfaces} \subsubsection{A simple application: single gene logistic regression} Here we use a generic method that operates on \texttt{ExpressionSet} instances and formulae to carry out the same logistic regression analysis, but with cross-validation. In this species of cross-validation, the dataset is partitioned into five subsets, each of which is formed to have approximately equal representation of the outcome classes. <>= library(MLInterfaces) lr2 = MLearn(mol.biol~., fbALL[3,], glmI.logistic(thresh=.5), xvalSpec("LOG", 5, balKfold.xvspec(5)), family=binomial) lr2 @ Notice that the result of this call is not a table of coefficients, but an object. That object has the same formal structure for any successful call to \texttt{MLearn}. We can get the table of coefficients through the following: <>= summary(RObject(RObject(lr2)[[1]]$mlans)) @ The access path to the coefficients is somewhat convoluted -- two levels of storage must be traversed. At the top level (interior call to \texttt{RObject}, applied to \texttt{lr2}), we are looking into the result of five cross-validation iterations. We pick the first using \texttt{[[1]]}. This returns a list. The element named \texttt{mlans} holds the table of coefficients (actually the glm model fit), retrieved using the outer call to \texttt{RObject}, rendered using \texttt{summary}. The \texttt{MLearn} method is tailored to predictive applications. In the use of logistic modeling, a threshold is specified in the third argument to MLearn. The predicted probability of fusion for each sample is computed according to the fitted model and if it exceeds the threshold parameter, the sample is predicted to be in the fusion class; otherwise it is predicted to be negative for fusion. The confusion matrix for the cross-validated prediction exercise cross-tabulates known class vs predicted class for all samples. The proportion of off-diagonal entries is an estimate of the misclassification rate. \subsubsection*{Exercises} \begin{itemize} \item If we relax the threshold for classifying to fusion to 40\%, what happens to the misclassification rate for cross-validated, single-gene, logistic regression-based prediction? \end{itemize} %lr3 = MLearn(mol.biol~., fbALL[3,], glmI.logistic(thresh=.40), % xvalSpec("LOG", 5, balKfold.xvspec(5)), % family=binomial) \subsubsection{Prediction with many genes} We will now illustrate linear discriminant analysis (LDA). We use a collection of genes, denoted $x_i$ to characterize sample $i$, and compute the multivariate mean for each class (denoted $\mu_F$ and $\mu_N$ for fusion and negative respectively) and the common covariance matrix $\Sigma$ for all the observations. If $\pi_F$ and $\pi_N$ are the proportions of fusion and negative samples in the dataset, The classification procedure is to allocate the sample with gene `signature' $x_i$ to class $F$ when \[ LD(x_i) = (\mu_F - \mu_N)^t \Sigma^{-1}(x_i - .5(\mu_F+\mu_N)) > \log(\pi_F/\pi_N). \] For this to be feasible on modest hardware, we need to filter the genes in use. We use genefilter's \texttt{nsFilter} procedure to eliminate genes with relatively low variability across samples. <>= library(genefilter) ffbALL = nsFilter(fbALL, var.func=var, var.cutoff=.9) ffbALL[[2]] # check exclusion events ffbALL = ffbALL[[1]] # keep only ExpressionSet @ Now we construct a cross-validated linear discriminant analysis using all filtered genes. <>= lda1 = MLearn(mol.biol~., ffbALL, ldaI, xvalSpec("LOG", 5, balKfold.xvspec(5))) <>= mm = confuMat(lda1) mcr = (sum(mm)-sum(diag(mm)))/sum(mm) mcr @ Is this a legitimate application? Is \Sexpr{round(mcr,3)} a good estimate of the misclassification rate of the procedure? Perhaps not, because feature selection was conducted outside the cross-validation procedure. The set of genes filtered away will differ from iteration to iteration of the cross-validation. This process can be factored into our cross-validation using the \texttt{fsFun} parameter of \texttt{xvalSpec}. Ideally we would apply this feature selection process to the full fbALL ExpressionSet but on a small computer this seems to cause problems. We illustrate with only 2500 genes: <>= lda2 = MLearn(mol.biol~., fbALL[1:2500,], ldaI, xvalSpec("LOG", 5, balKfold.xvspec(5), fsFun=fs.topVariance(.9))) <>= lda2 confuMat(lda2) length(fsHistory(lda2)[[1]]) @ \subsection{Summary} \bi \item ExpressionSets store assay and sample-level data from microarray experiments; \item Manual application of standard statistical modeling tools to test specific gene effects is feasible, but variation in calling sequence and return values reduces efficiency of application and interpretation; \item MLInterfaces MLearn supports direct application of supervised learning methods to ExpressionSet instances \bi \item standard formula interface may refer to any phenoData variable for response \item learnerSchema instances select the algorithm to be used; we looked at glmI.logistic and ldaI \item cross-validation is supported through the xvalSpec object \item algorithmic feature selection can be embedded in cross-validation \ei \item MLearn returns a structured object responding to \texttt{confuMat}, \texttt{RObject}, and, when relevant, \texttt{fsHistory} \ei \section{Some technical details of MLInterfaces} \subsection{The signature of MLearn} <>= showMethods("MLearn") @ \subsection{Available learnerSchema instances} <>= grep(".*I($|\\.)", ls("package:MLInterfaces"), value=TRUE) @ You can add your own schemata for new learning functions. See the vignette MLint\_devel. \subsection{Tuning} You can set additional parameters in the ... argument place to MLearn. Eventually a tuningSpec object will be defined to control this. \section{Supervised learning: Additional illustrations} \subsection{CART} Decision trees are attractive models for certain investigations. If the process under study has a hierarchical structure, so that some features decompose the population at a high level, and others operate within lower level components, a tree-structured model may be useful. Classification And Regression Trees (CART) denotes a family of algorithms that aggressively sift through features in a recursive series of splits of the data. At the first stage, all the data live in a root tree node. All features are dichotomized in all possible ways and the node is split using the feature that leads to two nodes that are most pure in distribution of the class label according to some user-selected metric such as the Gini index or the deviance. The process recurses in the new nodes. The tree construction proceeds until nodes reach some minimal size, and then it may be pruned back. Details can be found in Ripley, Pattern Recognition and Neural Networks, 1995. <>= rp1 = MLearn(mol.biol~., ffbALL, rpartI, xvalSpec("LOG", 5, balKfold.xvspec(5))) <>= confuMat(rp1) @ Each fit yields an extensive summary. <>= summary(RObject(RObject(rp1)[[1]]$mlans)) @ We have cross-validated and can inspect the tree for each iteration: <>= tr1 = RObject(RObject(rp1)[[1]]$mlans) plot(tr1, uniform=TRUE, branch=.2, compress=TRUE, margin=.1, main= paste("xval tree", 1)) text(tr1, all=TRUE, use.n=TRUE, fancy=TRUE, pretty=TRUE) @ <>= tr2 = RObject(RObject(rp1)[[2]]$mlans) plot(tr2, uniform=TRUE, branch=.2, compress=TRUE, margin=.1, main= paste("xval tree", 2)) text(tr2, all=TRUE, use.n=TRUE, fancy=TRUE, pretty=TRUE) @ \subsubsection*{Exercises} \bi \item The trees visualized here are not as informative as they would be if gene symbols were used for probe sets. Alter ffbALL so that the featureNames are symbols and generate a better plot. You could use code like <>= X = featureNames(ffbALL) library(hgu95av2.db) SX = mget(X, hgu95av2SYMBOL) any(duplicated(unlist(SX))) featureNames(ffbALL) = SSX rrp1 = MLearn(mol.biol~., ffbALL, rpartI, xvalSpec("LOG", 5, balKfold.xvspec(5))) @ \item Create a deeper set of trees by specifying an option defined in rpart.control. For example, set minsplit=3. Use \texttt{plotcp} on the resulting tree objects and interpret. \ei \subsection{Random forests and variable importance assessment} Leo Breiman extended the tree structured modeling approach by integrating random feature and case selection over a long sequence of tree fits. Voting over the tree sequence is used to create the classifier. According to wikipedia, ``Each tree is constructed using the following algorithm: \bi \item Let the number of training cases be N, and the number of variables in the classifier be M. \item We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. \item Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. \item For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. \item Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier). \ei This is easy to use with MLearn. Because of the internal resampling, we do not need to cross-validate (unless we really want to). <>= set.seed(12345) <>= rf1 = MLearn(mol.biol~., ffbALL, randomForestI, xvalSpec("NOTEST"), importance=TRUE) <>= confuMat(rf1, "train") @ The variable importance measure can be rendered as follows: <>= par(las=2, mar=c(5,9,5,5)) plot(getVarImp(rf1, TRUE), plat="hgu95av2", toktype="SYMBOL") @ \subsubsection*{Exercises} \bi \item Generate a textual report on the relative importance measures, using \texttt{getVarImp}. \item Compare the top genes identified via randomForest to those identified via limma. Comment on the added learning provided by the machine learning algorithm. \ei \subsection{Regularized discriminant analysis} This is an interesting algorithm in which LDA is generalized in two directions. First, the covariance matrix of the data is modeled as a linear combination of an identity matrix and the sample covariance matrix. Second, features are dropped according to their distance from sample centroids. See the 2007 Biostatistics paper of Guo and Tibshirani for details. The code distributed by Guo et al has internal cross-validation by which parameters $\alpha \in [0,1]$ (weight on the sample covariance matrix, to which $1 - \alpha \times I$ is added to get the effective covariance to be used in discriminant computation) and $\delta$ (parameter dictating how features are dropped depending on their distance from data centroids) are selected. We therefore use RDA in two stages, first to inspect the results of internal cross-validation, and then to compute the predictors for the `optimal' $\alpha$ and $\delta$. (There are typically a range of attractive values for these parameters.) <>= set.seed(12345) <>= rda1 = MLearn(mol.biol~., ffbALL, rdacvI, xvalSpec("NOTEST")) @ We now retrieve the summary of cross-validation results: <>= attr(RObject(rda1), "xvalAns") @ We see that values of $\alpha$ around .11 and $\delta$ around .33 lead to small numbers of errors in cross-validation. Thus: <>= rda2 = MLearn(mol.biol~., ffbALL, rdaI, xvalSpec("NOTEST"), alpha=.11, delta=.333) confuMat(rda2, "train") @ \subsubsection*{Exercises} \bi \item There is a problem with the rendering of the confusion matrix above. Describe how to avoid it. \item Obtain the list of `retained genes' by inspecting the rda2 object. Compare to limma top table. \ei \subsection{Support vector machine} An important variation on LDA is the support vector machine (SVM) algorithm. The basic ideas can be gleaned from a paper by Kristin Bennett to be distributed at the course. <>= svm1 = MLearn(mol.biol~., ffbALL, svmI, xvalSpec("LOG", 5, balKfold.xvspec(5)), kernel="linear") confuMat(svm1) @ \subsubsection*{Exercises} \bi \item Examine the \texttt{tune} infrastructure of package e1071 and consider whether a selection of parameters for tuning the svm can improve performance in this case. To use tune you will have to extract response and predictor data and convert to appropriate formats. \ei \subsection{Boosting} Random forests uses an ensemble of trees to generate predictions; boosting uses very simple trees generated along a sequence of data reweighting steps. At each iteration, data that are easy to classify are downweighted. We use the implementation in adaboost package; there are others. This seems to be a very intensive algorithm and we run it for a very short while for feasibility. You may examine the effects of reducing the feature set. <>= ada1 = MLearn(mol.biol~., ffbALL, adaI, 1:40, type="discrete", iter=20) confuMat(ada1) @ \section{Unsupervised learning in brief} Unsupervised learning occurs in the absence of class labels. In essence, one is trying to learn both the class labels and the rules for attaching them to objects yet unseen. Cluster analysis is widely used, is propagated through heatmap displays, and has substantial conceptual defects. We will not discuss it unless there is extra time. \subsection{PCA and biplots} To discuss principal components and dimension reduction, we use a simpler data set: the crabs data set in the MASS package. Consider the following display: <>= pairs(crabs[,-c(1:3)], col=ifelse(crabs$sp=="B", "blue", "orange")) @ Various crab body measurements are plotted against each other. Clearly there are high correlations between certain variables and it would be useful to focus on measures that are independently informative on the relationship between crab size and species. Such independently informative measures can be constructed using linear combinations of the raw measurements. \subsubsection{PCA defined and illustrated} Principal components analysis transforms the multivariate data X into a new coordinate system. If the original variables are X1, \ldots, Xp, then the variables in the new representation are denoted PC1, \ldots, PCp. These new variables have the properties that PC1 is the linear combination of the X having maximal variance, PC2 is the variance-maximizing linear combination of residuals of X after projecting on PC1, and so on. If most of the variation in $X_{n \times p}$ can be captured in a low dimensional linear subspace of the space spanned by the columns of $X$, scatterplots of the first few principal components will depict this. Formally, we can compute the PC using the singular value decomposition of $X$, in which $X = UDV^t$, where $U_{n \times p}$ and $V_{p \times p}$ are orthonormal, and $D$ is a diagonal matrix of $p$ nonnegative singular values. The principal components transformation is $XV = UD$, and if $D$ is structured so that $D_{ii} \geq D_{jj}$ whenever $i > j$, then column $i$ of $XV$ is PCi. Note also that $D_{ii} = \sqrt{n-1}\mbox{SD}$ PCi. <>= library(MASS) data(crabs) pcs = prcomp( crabs[,-c(1:3)] ) <>= plot(pcs) <>= pairs(pcs$x, col=ifelse(crabs$sp=="B", "blue", "orange")) @ \subsubsection{Biplots: superimposing samples and variables after dimension reduction} The biplot shows the data in PC space and also shows the relative contributions of the original variables in composing the transformation. <>= biplot(pcs, choices=2:3) @ In addition to the sample representation in PC1-PC2, we have a representation of the original variables and their roles in defining the principal components transformation. The right-hand and top axes measure the first and second components of the eigenvectors corresponding to the original variables. A formal definition of the biplot procedure is given in Venables and Ripley, and is worth reviewing. Rows of $X$ are observations and columns are variables. A rank-2 approximation to $X$ is obtained via singular value decomposition, setting all but the largest two singular values to zero. Now \[ X \approx [u_1 u_2] \left[ \begin{array}{cc} \lambda_1 & 0 \\ 0 & \lambda_2 \end{array} \right] \left[ \begin{array}{c} v_1^t \\ v_2^t \end{array} \right] = GH^t \] and various approaches can be entertained for absorption of the eigenvalues $\lambda_i$ into $G$ and $H$. A two-parameter system for doing this is \[ G = n^{\alpha/2} [u_1 u_2] \left[ \begin{array}{cc} \lambda_1 & 0 \\ 0 & \lambda_2 \end{array} \right]^{1-\theta}~,~~~ H = n^{\alpha/2} [v_1 v_2] \left[ \begin{array}{cc} \lambda_1 & 0 \\ 0 & \lambda_2 \end{array} \right]^{\theta}. \] The ``principal component biplot'' sets $\alpha = \theta = 1$, and consists in plotting rows of $G$ and $H$ with distinguished symbols. Euclidean distances between rows of $G$ represent Mahalanobis distances between observations; inner products betwen rows of $H$ represent covariances between variables. \subsubsection*{Exercise} Create a biplot using the gene expression data from ALL, focusing on genes found to be predictive by randomForests. Interpret. \end{document}