%\documentclass[article]{jss} \documentclass[nojss,article]{jss} % ----- for the package-vignette, don't use JSS logo, etc % % \author{Martin Maechler\\ Seminar f\"ur Statistik \\ ETH Zurich, \ Switzerland % \\\email{maechler@stat.math.ethz.ch}} \author{Martin M\"achler \\ ETH Zurich} \title{Dip Test Distributions, P-values, and other Explorations} % \def\mythanks{a version of this paper, for \pkg{nacopula} 0.4\_4, has been published % in JSS, \url{http://www.jstatsoft.org/v39/i09}.} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Martin M\"achler} %% comma-separated \Plaintitle{Dip Test Distributions, P-values, and other Explorations} % \Shorttitle{} %\date{April 2009 ({\tiny typeset on \tiny\today})} %%\VignetteIndexEntry{Dip Test Distributions, P-values, and other Explorations} %%\VignetteDepends{diptest} \SweaveOpts{engine=R,keep.source=TRUE,strip.white=true} % ^^^^^^^^^^^^^^^^ \SweaveOpts{eps=FALSE,pdf=TRUE,width=7,height=4} %% an abstract and keywords \Abstract{ ... % FIXME ... % FIXME } % \Keywords{MPFR, Abitrary Precision, Multiple Precision Floating-Point, R} %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} %% The address of (at least) one author should be given %% in the following format: \Address{ Martin M\"achler\\ Seminar f\"ur Statistik, HG G~16\\ ETH Zurich\\ 8092 Zurich, Switzerland\\ E-mail: \email{maechler@stat.math.ethz.ch}\\ URL: \url{http://stat.ethz.ch/people/maechler} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% MM: this is "substituted" by jss.cls: %% need no \usepackage{Sweave.sty} % \usepackage{myVignette} % \usepackage{fullpage}% save trees ;-) --- FIXME use {geometry} package % \usepackage[authoryear,round,longnamesfirst]{natbib} % \bibliographystyle{plainnat} % %% Marius' packages \usepackage[american]{babel}%for American English % \usepackage{microtype}%for character protrusion and font expansion (only with pdflatex) \usepackage{amsmath}%sophisticated mathematical formulas with amstex (includes \text{}) \usepackage{mathtools}%fix amsmath deficiencies \usepackage{amssymb}%sophisticated mathematical symbols with amstex (includes \mathbb{}) % \usepackage{amsthm}%theorem environments % \usepackage{bm}%for bold math symbols: \bm (= bold math) % %NON-STANDARD:\RequirePackage{bbm}%only for indicator functions % \usepackage{enumitem}%for automatic numbering of new enumerate environments % \usepackage[ % format=hang, % % NOT for JSS: labelsep=space, % justification=justified, % singlelinecheck=false%, % % NOT for JSS: labelfont=bf % ]{caption}%for captions % \usepackage{tikz}%sophisticated graphics package % \usepackage{tabularx}%for special table environment (tabularx-table) % \usepackage{booktabs}%for table layout % This is already in jss above -- but withOUT the fontsize=\small part !! \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontsize=\small,fontshape=sl} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small} \DefineVerbatimEnvironment{Scode}{Verbatim}{fontsize=\small,fontshape=sl} % but when submitting, do get rid of too much vertical space between R % input & output, i.e. between Sinput and Soutput: \fvset{listparameters={\setlength{\topsep}{0pt}}}% !! quite an effect! %% % \newcommand*{\R}{\proglang{R}}%{\textsf{R}} \newcommand*{\Arg}[1]{\texttt{\itshape $\langle$#1$\rangle$}} \newcommand*{\file}[1]{{`\normalfont\texttt{#1}'}} \newcommand*{\eps}{\varepsilon} % %% Probability P[.], Expectation E[.] etc \makeatletter %% == subsection of our flexible-style "texab.sty" : \newcommand{\@pkl}{[} % Probability Klammer links \newcommand{\@pkr}{]} \newcommand{\@ekl}{[} % Erwartungswert Klammer links \newcommand{\@ekr}{]} % Erwartungswert Klammer rechts \DeclareMathOperator{\PRSymbol}{P} % Next line (\makeright): if #1 == \left then \right #2 else #1 #2 \newcommand{\makeright}[2]{\ifx#1\left\right#2\else#1#2\fi} %% the real commands \newcommand{\PR}[2][\left] {\PRSymbol #1\@pkl #2 \makeright{#1}{\@pkr}} \newcommand{\ERW}[2][\left] {\ERWSymbol #1\@ekl #2 \makeright{#1}{\@ekr}} \makeatother \newcommand{\isD}{\ {\stackrel{\mathcal{D}}{=}}\ \ } \newcommand*{\iid}{\mbox{ i.i.d. }} % \begin{document} \setkeys{Gin}{width=\textwidth} % Manuel has \setlength{\abovecaptionskip}{-5pt} % %% include your article here, just as usual %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. % \section[About Java]{About \proglang{Java}} %% Note: If there is markup in \(sub)section, then it has to be escape as above. %% Note: These are explained in '?RweaveLatex' : \begin{footnotesize} <>= options(SweaveHooks= list(fig=function() par(mar=c(5.1, 4.1, 1.1, 2.1))), width = 75, digits = 7, # <-- here, keep R's default! prompt = "R> ", # <- "yuck!" - required by JSS continue=" ") set.seed(47) Sys.setenv(LANGUAGE = "en") if(.Platform$OS.type != "windows") Sys.setlocale("LC_MESSAGES","C") ## In order to save() and load() expensive results thisDir <- system.file('doc', package='diptest') ## xtraDir <- if(Sys.getenv("USER") == "maechler") ## "~/R/Pkgs/diptest/stuff" else thisDir ## res1.file <- file.path(thisDir, "aggr_results.Rdata") <>= if(nzchar(Sys.getenv("R_MM_PKG_CHECKING"))) print( .libPaths() ) @ \end{footnotesize} % \maketitle % \begin{abstract} % \end{abstract} \section[Introduction]{Introduction}% \small~\footnote{\mythanks}} \label{sec:Intro} %% MM FIXME: Need notation $D_n :=$\texttt{dip( runif(n) )}; but more generally, \begin{equation} \label{eq:Dn.F} D_n(F) := D(X_1, X_2, \dots, X_n), \mbox{ \ \ \texttt{where} } X_i \iid, X_i \sim F. \end{equation} \citet{HarJH85} in their ``seminal'' paper on the dip statistic $D_n$ already proved that $ \sqrt{n} \; D_n$ converges in distribution, i.e., \begin{equation} \label{eq:D.infty} \lim_{n\to\infty}\sqrt{n} \; D_n \isD D_\infty. \end{equation} A considerable part of this paper is devoted to explore the distribution of $D_\infty$. \bigskip \section[History of the diptest package]{History of the \texttt{diptest} \textsf{R} package} \citet{HarP85} published an implementation in Fortran of a concrete algorithm, % ALGORITHM AS 217 APPL. STATIST. (1985) VOL.34, NO.3 where the code was also made available on Statlib\footnote{Statlib is now a website, of course, \url{http://lib.stat.cmu.edu/}, but then was \emph{the} preferred way for distributing algorithms for statistical computing, available years before the existence of the WWW, and entailing e-mail and (anonymous) FTP} On July 28, 1994, Dario Ringach, then at NY University, asked on Snews (the mailing list for S and S-plus users) about distributions and was helped by me and then about \texttt{dyn.load} problems, again helped by me. Subsequently he provided me with S-plus code which interfaced to (a \texttt{f2c}ed version of) Hartigan's Fortran code, for computing the dip statistic. and ended the (then private) e-mail with \begin{quotation} I am not going to have time to set this up for submission to StatLib. If you want to do it, please go ahead. Regards, Dario \end{quotation} - several important bug fixes; last one Oct./Nov.~2003 However, the Fortran code file \url{http://lib.stat.cmu.edu/apstat/217}, was last changed {Thu 04 Aug 2005 03:43:28 PM CEST}. We have some results of the dip.dist of \emph{before} the bug fix; notably the ``dip of the dip'' probabilities have changed considerably!! - see rcs log of ../../src/dip.c \section{21st Century Improvement of Hartigan$^2$'s Table} (( Use listing package (or so to more or less ``cut \& paste'' the nice code in \texttt{../../stuff/new-simul.Rout-1e6} )) \section{The Dip in the Dip's Distribution} \label{sec:dip_dip} We have found empirically that the dip distribution itself starts with a ``dip''. Specifically, the minimal possible value of $D_n$ is $\frac{1}{2n}$ \emph{and} the probability of reaching that value, \begin{equation} \label{eq:P.Dn_min} \PR{D_n = \frac{1}{2n}}, \end{equation} is large for small $n$. E.g., consider an approximation of the dip distribution for $n=5$, <>= require("diptest") # after installing it .. D5 <- replicate(10000, dip(runif(5))) hist(D5, breaks=128, main = "Histogram of replicate(10'000, dip(runif(5))))") @ which looks as if there was a bug in the software --- but that look is misleading! Note how the phenomenon is still visible for $n=8$, <>= D8 <- replicate(10000, dip(runif(8))) hist(D8, breaks=128, main = "Histogram of replicate(10'000, dip(runif(8))))") @ Note that there is another phenomenon, in addition to the point mass at $1/(2n)$, particularly visible, if we use \emph{many} replicates, <>= set.seed(11) n <- 11 B.s11 <- 500000 D11 <- replicate(B.s11, dip(runif(n))) <<2nd-small-sample-phenomen--n-eq-11, echo=false>>= if(file.exists(ff <- file.path(thisDir, "hist-D11.rda"))) { load(ff) } else { ## takes a few minutes <> hD11 <- hist(D11, breaks=1e-6+(63:298)/(2*11*64), plot=FALSE) save(hD11, n, B.s11, file= ff) } <<2nd-small-sample-phenomen--n-eq-11, echo=false, fig=true>>= B.str <- format(B.s11, sci=FALSE, big.mark="'") plot(hD11, main = "", ## main = sprintf("Histogram of replicate(%s, dip(runif(%d)))", B.str, n), border=NA, col="dark gray", xlab = substitute("Dip" ~~ D[.N.](U(group("[",list(0,1),"]"))), list(.N. = n))) title(xlab= substitute(B == .B.SIM. ~ "replicates", list(.B.SIM. = B.str)), adj = .88) lcol <- adjustcolor("orange4", 0.4) abline(v = (1:3)/(2*n), col=lcol, lty=3, lwd=2) axis(1, pos=0, at = (1:3)/(2*n), labels = expression(1/22, 2/22, 3/22), col=lcol, col.axis=lcol) @ FIXME:\\ use \file{../../stuff/sim-minProb.R} \\ and \file{../../stuff/minProb-anal.R} Further, it can be seen that the \emph{maximal} dip statistic is $\frac 1 4 = 0.25$ and this upper bound can be reached simply (for even $n$) using the the data $(0,0,\dots,0, \; 1, 1,\dots,1)$, a bi-point mass with equal mass at both points. \section{P-values for the Dip Test} \label{sec:Pvals} Note that it is not obvious how to compute P-values for ``the dip test'', as that means determining the distribution of the test statistic, i.e., $D_n$ under the null hypothesis, but a natural null, $H_o: F \in \{F \mathrm{cadlag} \mid f := \frac d{dx} F is unimodal \}$ is too large. Hartigans'(1985) argued for using the uniform $U[0,1]$ i.e., $F'(x) = f(x)= \mathbf{1}_{[0,1]}(x) = [0 \le x \le 1]$ (Iverson bracket) instead, even though they showed that it is not quite the ``least favorable'' one. Following Hartigans', we will define the P-value of an observed $d_n$ as \begin{equation} \label{eq:Pval} P_{d_n} := \PR{D_n \ge d_n} := \PR{\mathrm{dip}(U_1,\dots,U_n) \ge d_n}, \ \ \mathrm{where} \ U_i \sim U[0,1], \ \, \iid \end{equation} \subsection{Interpolating the Dip Table} \label{sec:interpol} Because of the asymptotic distribution, $ \lim_{n\to\infty}\sqrt{n} \; D_n \isD D_\infty$, it is makes sense to consider the ``$\sqrt{n} D_n$''--scale, even for finite $n$ values: <>= data(qDiptab) dnqd <- dimnames(qDiptab) (nn. <- as.integer(dnqd[["n"]])) matplot(nn., qDiptab*sqrt(nn.), type ="o", pch=1, cex = 0.4, log="x", xlab="n [log scaled]", ylab = expression(sqrt(n) %*% q[D[n]])) ## Note that 1/2n is the first possible value (with finite mass),, ## clearly visible for (very) small n: lines(nn., sqrt(nn.)/(2*nn.), col=adjustcolor("yellow2",0.5), lwd=3) P.p <- as.numeric(print(noquote(dnqd[["Pr"]]))) ## Now look at one well known data set: D <- dip(x <- faithful$waiting) n <- length(x) points(n, sqrt(n)*D, pch=13, cex=2, col= adjustcolor("blue2",.5), lwd=2) ## a simulated (approximate) P-value for D is mean(D <= replicate(10000, dip(runif(n)))) ## ~ 0.002 @ but we can use our table to compute a deterministic (but still approximate, as the table is from simulation too) P-value: <>= ## We are in this interval: n0 <- nn.[i.n <- findInterval(n, nn.)] n1 <- nn.[i.n +1] ; c(n0, n1) f.n <- (n - n0)/(n1 - n0)# in [0, 1] ## Now "find" y-interval: y.0 <- sqrt(n0)* qDiptab[i.n ,] y.1 <- sqrt(n1)* qDiptab[i.n+1,] (Pval <- 1 - approx(y.0 + f.n*(y.1 - y.0), P.p, xout = sqrt(n) * D)[["y"]]) ## 0.018095 @ Finally, in May 2011, after several year of people asking for it, I have implemented a \code{dip.test} function which makes use of a --- somewhat more sophisticated --- interpolation scheme like the one above, to compute a $P$-value. As \code{qDiptab} has been based on $10^6$ samples, the interpolation yields accurate P-values, unless in very extreme cases. Here is the small ($n=63$) example from Hartigan$^2$, <>= data(statfaculty) dip.test(statfaculty) @ where, from a P-value of 8.7\%, we'd conclude that there is not enough evidence against unimodality. \subsection{Asymptotic Dip Distribution} \label{sec:asymp} We have conducted extensive simulations in order to explore the limit distribution of $D_\infty$, i.e., the limit of $\sqrt{n} \; D_n$, (\ref{eq:D.infty}). Our current \R\ code is in \file{ ../../stuff/asymp-distrib.R } but the simulation results (7 Megabytes for each $n$) cannot be assumed to be part of the package, nor maybe even to be simply accessible via the internet. %% is bandwidth a problem ? probably no longer in the near future? %% Maybe \section{Less Conservative Dip Testing} \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliography{diptest} \end{document}