September 7, 2018

Wilcoxon Test with Ties

  • Will show you warning message !
wilcox.test(c(-2, -2, 2, 2, 5, 5, 5), mu = 3)
## Warning in wilcox.test.default(c(-2, -2, 2, 2, 5, 5, 5), mu = 3): cannot
## compute exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  c(-2, -2, 2, 2, 5, 5, 5)
## V = 12, p-value = 0.7977
## alternative hypothesis: true location is not equal to 3

Suppress Warnings !

  • We can suppress this warning message, but it's not recommended in practice
suppressWarnings(wilcox.test(c(-2,-2,2,2,5,5,5),mu = 3))
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  c(-2, -2, 2, 2, 5, 5, 5)
## V = 12, p-value = 0.7977
## alternative hypothesis: true location is not equal to 3

Another Example

x2 <- c(5.6, 6.1, 6.3, 6.4, 6.5, 6.6, 7.0, 7.5, 7.9, 8.0,
        8.0, 8.1, 8.1, 8.2, 8.4, 8.5, 8.7, 9.4, 14.3, 26.0)
## Plot the density
plot(density(x2), main = "Water Content")

Wilcoxon Signed Rank Test

suppressWarnings(wilcox.test(x2, mu=9, conf.int=TRUE))
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x2
## V = 41, p-value = 0.01774
## alternative hypothesis: true location is not equal to 9
## 95 percent confidence interval:
##  7.150075 8.500071
## sample estimates:
## (pseudo)median 
##       7.810093

Sign Test

binom.test(sum(x2>9),length(x2),alternative = "two.sided")
## 
##  Exact binomial test
## 
## data:  sum(x2 > 9) and length(x2)
## number of successes = 3, number of trials = 20, p-value = 0.002577
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.03207094 0.37892683
## sample estimates:
## probability of success 
##                   0.15

Parametric t-test

t.test(x2,alternative = "two.sided",mu=9)
## 
##  One Sample t-test
## 
## data:  x2
## t = -0.22154, df = 19, p-value = 0.827
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   6.701535 10.858465
## sample estimates:
## mean of x 
##      8.78

Make a nice looking plot

library(ggplot2)
qplot(y=x2, x= 1, geom = "violin")+geom_abline(slope=0,intercept = 9)

Quantiles

The ecdf() function in R

Description

Compute an empirical cumulative distribution function, with several methods for plotting, printing and computing with such an “ecdf” object.

Usage

ecdf(x)

## S3 method for class 'ecdf'
plot(x, ..., ylab="Fn(x)", verticals = FALSE,
     col.01line = "gray70", pch = 19)

## S3 method for class 'ecdf'
print(x, digits= getOption("digits") - 2, ...)

## S3 method for class 'ecdf'
summary(object, ...)
## S3 method for class 'ecdf'
quantile(x, ...)

Arguments

x, object

numeric vector of the observations for ecdf; for the methods, an object inheriting from class "ecdf".

arguments to be passed to subsequent methods, e.g., plot.stepfun for the plot method.

ylab

label for the y-axis.

verticals

see plot.stepfun.

col.01line

numeric or character specifying the color of the horizontal lines at y = 0 and 1, see colors.

pch

plotting character.

digits

number of significant digits to use, see print.

Details

The e.c.d.f. (empirical cumulative distribution function) Fn is a step function with jumps i/n at observation values, where i is the number of tied observations at that value. Missing values are ignored.

For observations x= (x1,x2, … xn), Fn is the fraction of observations less or equal to t, i.e.,

Fn(t) = #{xi <= t}/n = 1/n sum(i=1,n) Indicator(xi <= t).

The function plot.ecdf which implements the plot method for ecdf objects, is implemented via a call to plot.stepfun; see its documentation.

Value

For ecdf, a function of class "ecdf", inheriting from the "stepfun" class, and hence inheriting a knots() method.

For the summary method, a summary of the knots of object with a "header" attribute.

The quantile(obj, …) method computes the same quantiles as quantile(x, …) would where x is the original sample.

Note

The objects of class "ecdf" are not intended to be used for permanent storage and may change structure between versions of R (and did at R 3.0.0). They can usually be re-created by

    eval(attr(old_obj, "call"), environment(old_obj))

since the data used is stored as part of the object's environment.

Author(s)

Martin Maechler; fixes and new features by other R-core members.

See Also

stepfun, the more general class of step functions, approxfun and splinefun.

Examples

##-- Simple didactical  ecdf  example :
x <- rnorm(12)
Fn <- ecdf(x)
Fn     # a *function*
Fn(x)  # returns the percentiles for x
tt <- seq(-2, 2, by = 0.1)
12 * Fn(tt) # Fn is a 'simple' function {with values k/12}
summary(Fn)
##--> see below for graphics
knots(Fn)  # the unique data values {12 of them if there were no ties}

y <- round(rnorm(12), 1); y[3] <- y[1]
Fn12 <- ecdf(y)
Fn12
knots(Fn12) # unique values (always less than 12!)
summary(Fn12)
summary.stepfun(Fn12)

## Advanced: What's inside the function closure?
ls(environment(Fn12))
##[1] "f"  "method"  "n"  "x"  "y"  "yleft"  "yright"
utils::ls.str(environment(Fn12))
stopifnot(all.equal(quantile(Fn12), quantile(y)))

###----------------- Plotting --------------------------
require(graphics)

op <- par(mfrow = c(3, 1), mgp = c(1.5, 0.8, 0), mar =  .1+c(3,3,2,1))

F10 <- ecdf(rnorm(10))
summary(F10)

plot(F10)
plot(F10, verticals = TRUE, do.points = FALSE)

plot(Fn12 , lwd = 2) ; mtext("lwd = 2", adj = 1)
xx <- unique(sort(c(seq(-3, 2, length = 201), knots(Fn12))))
lines(xx, Fn12(xx), col = "blue")
abline(v = knots(Fn12), lty = 2, col = "gray70")

plot(xx, Fn12(xx), type = "o", cex = .1)  #- plot.default {ugly}
plot(Fn12, col.hor = "red", add =  TRUE)  #- plot method
abline(v = knots(Fn12), lty = 2, col = "gray70")
## luxury plot
plot(Fn12, verticals = TRUE, col.points = "blue",
     col.hor = "red", col.vert = "bisque")

##-- this works too (automatic call to  ecdf(.)):
plot.ecdf(rnorm(24))
title("via  simple  plot.ecdf(x)", adj = 1)

par(op)

[Package stats version 3.4.1 Index]

Convergence

Empirical CDF

set.seed(123)
emp1 <- ecdf(rnorm(20)) # E-CDF of 20 samples from N(0,1)
emp2 <- ecdf(rnorm(50)) # E-CDF of 20 samples from N(0,1)

Q-Q Plots (One-sample)

set.seed(12) # Reproducibility
y <- rnorm(100)
qqnorm(y, ylim=c(-3,3), main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE)
qqline(y, distribution = qnorm)

Two samples same

z <- rnorm(100)
qqplot(y,z,main = "Two-sample Q-Q Plot",
       xlab = "Sample 1 Quantiles", ylab = "Sample 2 Quantiles")
abline(0, 1)

Two samples different

x = rnorm(10000,0,2); y = rnorm(10000,0,4)
qqplot(x,y,main = "Two-sample Q-Q Plot",
       xlab = "Sample 1 Quantiles", ylab = "Sample 2 Quantiles")
abline(0, 1)

A Common Mistake

x = rnorm(10000,0,2); y = rnorm(10000,0,4)
qqplot(x,y,main = "Two-sample Q-Q Plot",
       xlab = "Sample 1 Quantiles", ylab = "Sample 2 Quantiles")

Histograms

hist(x,breaks=50,freq=F,col=rgb(1,0,0,0.5),xlim=c(-15,15))
hist(y,breaks=50,freq=F,col=rgb(0,0,1,0.5),add=T)
box()

Or, Boxplots

boxplot(y,z)

Even better, test a hypothesis

ks.test(rnorm(20),pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rnorm(20)
## D = 0.24294, p-value = 0.1593
## alternative hypothesis: two-sided

In our case

ks.test(x,y)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.1691, p-value < 2.2e-16
## alternative hypothesis: two-sided

Next Time

  • We will dig deeper into these tests !
  • Kolmogorov-Smirnov, Lilliefors, Chi-square GoF etc.
  • These are important as they tell you whether your data distribution is Normal, i.e. whether we should use nonparametric methods that do not assume Normality or parametric methods that assume Normality.