\documentclass{article} \usepackage{amsmath} \usepackage{Sweave} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} %\VignetteIndexEntry{Splines, plots, and interactions} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{prefix.string=splines,width=6,height=4} \setkeys{Gin}{width=\textwidth} <>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=8) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #reset default @ \title{Spline terms in a Cox model} \author{Terry Therneau} \begin{document} \maketitle This is a pair of topics that comes up just often enough in my work that I end up re-discovering how to do it correctly about once a year. A note showing how may be useful to others, it is certainly a useful reference for me. \section{Plotting smooth terms} Here is a simple example using the MGUS data. (And turning off the garish color defaults of termplot). <>= require(survival) mfit <- coxph(Surv(futime, death) ~ sex + pspline(age), data=mgus) termplot(mfit, term=2, se=TRUE, col.term=1, col.se=1) @ Two questions of the plot are ``how was it centered'' and whether we can easily plot it on the hazard as opposed to the log hazard scale. The solution to both is to use the plot=FALSE option of termplot, which returns the data points that would be plotted back to the user. <>= ptemp <- termplot(mfit, se=TRUE, plot=FALSE) attributes(ptemp) ptemp$age[1:4,] @ The termplot function depends on a call to predict with type='terms', which returns a centered set of predictions. Like a simple linear model fit, the intercept is a separate term, which is found in the ``constant'' attribute above, and each column of the result is centered at zero. Since any given $x$ value may appear multiple times in the data and thus in the result of predict, and the termplot function removes duplicates, the plot may not be exactly centered at zero however. Now suppose we want to redraw this on log scale with age 50 as the reference, i.e., the risk is 1 for a 50 year old. Since the Cox model is a relative hazards model we can choose whatever center we like. (If there were no one of exactly age 50 in the data set the first line below would need to do an interpolation, e.g. by using the approx function.) <>= center <- with(ptemp$age, y[x==50]) ytemp <- ptemp$age$y + outer(ptemp$age$se, c(0, -1.96, 1.96), '*') matplot(ptemp$age$x, exp(ytemp - center), log='y', type='l', lty=c(1,2,2), col=1, xlab="Age at diagnosis", ylab="Relative death rate") @ Voila! We now have a plot that is more interpretable. The approach is appropriate for any term, not just psplines. The above plot uses log scale for the y axis which is appropriate for the question of whether a non-linear age effect was even necessary for this data set (it is not), one could remove the log argument to emphasize the Gomperzian effect of age on mortality. \section{Splines in an interaction} As an example we will use the effect of age on survival in the \texttt{flchain} data set, a population based sample of subjects from Olmsted County, Minnesota. If we look at a simple model using age and sex we see that both are very significant. <>= options(show.signif.stars=FALSE) # display intelligence fit1 <- coxph(Surv(futime, death) ~ sex + pspline(age, 3), data=flchain) fit1 termplot(fit1, term=2, se=TRUE, col.term=1, col.se=1, ylab="log hazard") @ We used a smoothing spline because the printout then nicely segregates the linear and non-linear effects. The non-linearity is not very large, as compared to the linear portion, but still may be important. We would like to go forward and fit separate age curves for the males and the females, since the above fit makes an unwarranted assumption that the male/female ratio of death rates will be the same at all ages. The primary problem is that a formula of \texttt{sex * pspline(age)} does not work; the coxph routine is not clever enough to do the right thing automatically. (Perhaps some future version will be sufficiently bright, but don't hold your breath). If we were using regression splines instead, e.g. \texttt{ns(age, df=4)}, the coxph routine would succeed but the plotting would fail; the solution below works for both cases. We need to create our own dummy variables to handle the interaction. <>= agem <- with(flchain, ifelse(sex=="M", age, 60)) agef <- with(flchain, ifelse(sex=="F", age, 60)) fit2 <- coxph(Surv(futime, death) ~ sex + pspline(agef, df=3) + pspline(agem, df=3), data=flchain) anova(fit2, fit1) @ The gain in this particular problem is not great, but we will forge ahead. You might well ask why we used 60 as a dummy value of \texttt{agem} for the females instead of 0? There is nothing special about the choice, and any value within the range of ages would do as well, though I try to pick one where the standard errors of the curves are not outrageous. If a value of 0 is used it forces the pspline function to create a basis set that includes all the empty space between 0 and 50, and do predictions at 0; these last can become numerically unstable leading to errors or incorrect values. The Cox model deals with relative hazards, when doing a plot we will usually want to specify who our reference is. By default the termplot function uses an average reference, that is, any plot will be centered to have an average log hazard of 0. In this case, we decided to use 65 year old females as our reference, with all of the hazards relative to them. <>= # predictions pterm <- termplot(fit2, term=2:3, se=TRUE, plot=FALSE) # reference refdata <- data.frame(sex=c('F', 'M'), agef=c(65, 60), agem=c(60,65)) pred.ref <- predict(fit2, newdata=refdata, type="lp") # females tempf <- pterm$agef$y + outer(pterm$agef$se, c(0, -1.96, 1.96)) frow <- which(pterm$agef$x == 65) tempf <- tempf - tempf[frow,1] # shift curves # males tempm <- pterm$agem$y + outer(pterm$agem$se, c(0, -1.96, 1.96)) mrow <- which(pterm$agem$x == 65) tempm <- tempm + diff(pred.ref) - tempm[mrow,1] # plot matplot(pterm$agef$x, exp(tempf), log='y', col=1, lty=c(1,2,2), type='l', lwd=c(2,1,1), xlab="Age", ylab="Relative risk of death") matlines(pterm$agem$x, exp(tempm), log='y', col=2, lwd=c(2,1,1), lty=c(1,2,2)) legend(80, 1, c("Female", "Male"), lty=1, lwd=2, col=1:2, bty='n') @ \begin{enumerate} \item The termplot routine is used to get the data points for the plot, without executing a plot, by use of the \texttt{plot=FALSE} argument. The result is a list with one element per term; each element of the list contains x, y, and se components. \item We had decided to center the female curve at age 65, risk =1. The relative offset for the male curve can be derived directly from \texttt{fit2} by adding up the right coefficients, and I used to do it that way but would get it wrong one time out of two. So instead use the \texttt{predict} routine to get predicted log hazards for males and females at a particular age. This tells me how far apart the curves should be at that point. We force the females to go through 0, which is exp(0) =1 on the hazard scale. \item Get the predicted curve and confidence bands for the females as a matrix \texttt{tempf}, and then shift them by subtracting the value for a 65 year old female. Do the same for males, plus adding in the curve separation at age 65 from \texttt{pred.ref}. \item The male and female portions don't have quite the same set of age values, there are no 95 year old males in the data set for example, so the plot needs to be done in two steps. \end{enumerate} The final curves for males and female are not quite parallel. One thing the plot does not display is that the spacing between the male and female points also has a standard error. This moves the entire bundle of three red curves up and down. It is not clear how best to add this information into the plot. For questions of parallelism and shape, as here, it seemed best to ignore it, which is what the termplot function also does. If someone were reading individual male/female differences off the plot a different choice would be appropriate. \end{document}