October 18, 2017

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

I will show here the R codes for ANOVA. The two examples are taken from the following websites.

The Data

We first read the InsectSprays dataset that has data on pain levels for three types of drugs. We create a data frame in R to store the data.

pain = c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 
         8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
drug = c(rep("A",9), rep("B",9), rep("C",9))
migraine = data.frame(pain,drug)
str(migraine)
## 'data.frame':    27 obs. of  2 variables:
##  $ pain: num  4 5 4 3 2 4 3 4 4 6 ...
##  $ drug: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...

Descriptive statistics

  • Mean, variance, number of elements in each cell
  • Visualise the data - boxplot; look at distribution, look for outliers
  • We'll use the tapply() function which is a helpful shortcut in processing data.
  • tapply() basically allows you to specify a response variable, a factor (or factors) and a function that should be applied to each subset of the response variable defined by each level of the factor.

Descriptive Statistics

Instead of doing

options(digits=10)
mean(pain[drug=="A"])
## [1] 3.666666667
mean(pain[drug=="B"])
## [1] 5.777777778

We can simply call

tapply(pain, drug, mean)
##           A           B           C 
## 3.666666667 5.777777778 5.888888889

Descriptive Statistics

  • We need the sample variances, i.e. \(s_j^2\) for \(j = 1, 2, 3\).
tapply(pain, drug, var)
##            A            B            C 
## 0.7500000000 2.1944444444 0.6111111111
  • We also need the sample sizes for each group. (Note: it will not always be equal or obvious!)
tapply(pain, drug, length)
## A B C 
## 9 9 9

Visual comparison

We can look at the boxplot for the three groups:

plot(pain ~ drug, data=migraine)

More Visual comparisons

R package 'mosaic' and 'gmodel' can create better looking plots and summary:

require(mosaic)
require(gmodels)
options(digits=3)
favstats(pain ~ drug, data=migraine)
##   drug min Q1 median Q3 max mean    sd n missing
## 1    A   2  3      4  4   5 3.67 0.866 9       0
## 2    B   4  5      6  6   8 5.78 1.481 9       0
## 3    C   5  5      6  6   7 5.89 0.782 9       0

Histogram for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
histogram(~ pain| drug, fit="normal", layout=c(1, 3), data=migraine)

Density Plots for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
densityplot(~ pain, groups=drug, auto.key=TRUE, data=migraine)

Box and Whisker Plots for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
bwplot(pain ~ drug, data=migraine)

ANOVA Sum of squares

  • We need to calculate two sum of squares for the F-statistic.
  • The numertaor is between groups variation or \(\sum_{j=1}^{K} n_j (\bar{x}_j - \bar{X})^2\). (\(n_j\): sample size of the \(j^{th}\) group).
  • We have calculated the \(n_j\)'s and the \(\bar{x}_j\)'s before using tapply().
n_j = tapply(pain,drug,length)
SST_j = n_j * (tapply(pain,drug,mean)-mean(pain))^2
SST = sum(SST_j)
cat("SST = ", SST, '\n')
## SST =  28.2

SSE

Variation within the treatments = Sum of squares for Error (SSE): \[ SSE = (n_1 -1)s_1^2 + (n_2 - 1)s_2^2 + \cdots + (n_K -1)s_K^2 \]

SSE_j = (n_j-1)*tapply(pain,drug,var)
SSE = sum(SSE_j)
cat("SSE =", SSE, "\n")
## SSE = 28.4

F-statistic

  • Mean square for treatments = MST = \(\frac{SST}{k-1} = \frac{28.2}{2} = 14.11\)
  • Mean square for errors = MSE = \(\frac{SSE}{n-k} = \frac{28.4}{27-3} = 1.19\).
  • The ratio of the variability among the treatment means to that within the treatment means is an \(F\)-statistic: \[ F_{k-1,n-k} = \frac{MST}{MSE} = \frac{14.11}{1.19} = 11.9 \] with \(k-1 = 2\) numerator and \(n-k = 24\) denominator degrees of freedom.
p.value = 1-pf(11.9,2,24); cat("P value =", p.value)
## P value = 0.000257

Reject the null hypothesis. There is a statistically significant difference between the mean pain levels for the three drugs.

ANOVA in R

Of course, we can skip the step-by-step calculation and use a pre-canned package in R:

results <- aov(pain~drug, data=migraine)
summary(results)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## drug         2   28.2   14.11    11.9 0.00026 ***
## Residuals   24   28.4    1.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise tests

  • ANOVA test tells you whether there is a significant difference between the means, but it does not provide any more information about which ones are different.
  • The R function pairwise.t.test() compares all possible pairs with corrections for multiple testing.
 pairwise.t.test(pain, drug, p.adjust="bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  pain and drug 
## 
##   A     B    
## B 0.001 -    
## C 7e-04 1.000
## 
## P value adjustment method: bonferroni

A is different from both B and C, but B & C are similar

Tukey's HSD

Another multiple comparisons procedure is Tukey???s method (a.k.a. Tukey's Honest Significance Test). The function TukeyHSD() creates a set of confidence intervals on the differences between means with the specified family-wise probability of coverage.

results = aov(pain ~ drug, data=migraine)
TukeyHSD(results, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = pain ~ drug, data = migraine)
## 
## $drug
##      diff    lwr  upr p adj
## B-A 2.111  0.830 3.39 0.001
## C-A 2.222  0.941 3.50 0.001
## C-B 0.111 -1.170 1.39 0.975

Another Example

Insect and Sprays

We are going to see another data-set InsectSprays. 6 different insect sprays (1 Categorical Variable with 6 levels) were tested to see if there was a difference in the number of insects found in the field after each spraying (Dependent Variable).

attach(InsectSprays)
data(InsectSprays)
str(InsectSprays)
## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Descriptive Statistics

tapply(count, spray, mean)
##     A     B     C     D     E     F 
## 14.50 15.33  2.08  4.92  3.50 16.67
tapply(count, spray, var)
##     A     B     C     D     E     F 
## 22.27 18.24  3.90  6.27  3.00 38.61
tapply(count, spray, length)
##  A  B  C  D  E  F 
## 12 12 12 12 12 12

More Visual comparisons

R package 'mosaic' and 'gmodel' can create better looking plots and summary:

require(mosaic)
require(gmodels)
favstats(count ~ spray, data=InsectSprays)
##   spray min    Q1 median   Q3 max  mean   sd  n missing
## 1     A   7 11.50   14.0 17.8  23 14.50 4.72 12       0
## 2     B   7 12.50   16.5 17.5  21 15.33 4.27 12       0
## 3     C   0  1.00    1.5  3.0   7  2.08 1.98 12       0
## 4     D   2  3.75    5.0  5.0  12  4.92 2.50 12       0
## 5     E   1  2.75    3.0  5.0   6  3.50 1.73 12       0
## 6     F   9 12.50   15.0 22.5  26 16.67 6.21 12       0

Histogram for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
histogram(~ count| spray, fit="normal", layout=c(2, 3), data=InsectSprays)

Density Plots for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
densityplot(~ count, groups=spray, auto.key=TRUE, data=InsectSprays)

Box and Whisker Plots for the three groups

trellis.par.set(theme=col.mosaic()) # better color scheme for lattice
bwplot(count ~ spray, data=InsectSprays)

ANOVA Sum of squares

  • We need to calculate two sum of squares for the F-statistic.
  • The numertaor is between groups variation or \(\sum_{j=1}^{K} n_j (\bar{x}_j - \bar{X})^2\). (\(n_j\): sample size of the \(j^{th}\) group).
  • We have calculated the \(n_j\)'s and the \(\bar{x}_j\)'s before using tapply().
n_j = tapply(count,spray,length)
SST_j = n_j * (tapply(count,spray,mean)-mean(count))^2
SST = sum(SST_j)
cat("SST = ", SST, '\n')
## SST =  2669

SSE

Variation within the treatments = Sum of squares for Error (SSE): \[ SSE = (n_1 -1)s_1^2 + (n_2 - 1)s_2^2 + \cdots + (n_K -1)s_K^2 \]

SSE_j = (n_j-1)*tapply(count,spray,var)
SSE = sum(SSE_j)
cat("SSE =", SSE, "\n")
## SSE = 1015

F-statistic

  • Mean square for treatments = MST = \(\frac{SST}{k-1} = \frac{2669}{5} = 533.8\)
  • Mean square for errors = MSE = \(\frac{SSE}{n-k} = \frac{1015}{72-6} = 15.378\).
  • The ratio of the variability among the treatment means to that within the treatment means is an \(F\)-statistic: \[ F_{k-1,n-k} = \frac{MST}{MSE} = \frac{533.8}{15.378} = 34.7 \] with \(k-1 = 5\) numerator and \(n-k = 66\) denominator degrees of freedom.
p.value = 1-pf(34.7,5,66);cat("P value =", p.value)
## P value = 0

Reject the null.

ANOVA in R

Of course, we can skip the step-by-step calculation and use a pre-canned package in R:

results <- aov(count~spray, data=InsectSprays)
summary(results)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669     534    34.7 <2e-16 ***
## Residuals   66   1015      15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise tests

  • Pairwise test tell us which ones are different.
  • The R function pairwise.t.test() compares all possible pairs with corrections for multiple testing.
 pairwise.t.test(count, spray, p.adjust="bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  count and spray 
## 
##   A     B     C     D     E    
## B 1     -     -     -     -    
## C 1e-09 1e-10 -     -     -    
## D 1e-06 2e-07 1     -     -    
## E 4e-08 5e-09 1     1     -    
## F 1     1     4e-12 6e-09 2e-10
## 
## P value adjustment method: bonferroni

A, B and F are similar, C, D and E are similar. But each of A, B & F are different from each of C, D and E !

Tukey's HSD

results = aov(count ~ spray, data=InsectSprays)
TukeyHSD(results, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##        diff    lwr   upr p adj
## B-A   0.833  -3.87  5.53 0.995
## C-A -12.417 -17.12 -7.72 0.000
## D-A  -9.583 -14.28 -4.88 0.000
## E-A -11.000 -15.70 -6.30 0.000
## F-A   2.167  -2.53  6.87 0.754
## C-B -13.250 -17.95 -8.55 0.000
## D-B -10.417 -15.12 -5.72 0.000
## E-B -11.833 -16.53 -7.13 0.000
## F-B   1.333  -3.37  6.03 0.960
## D-C   2.833  -1.87  7.53 0.492
## E-C   1.417  -3.28  6.12 0.949
## F-C  14.583   9.88 19.28 0.000
## E-D  -1.417  -6.12  3.28 0.949
## F-D  11.750   7.05 16.45 0.000
## F-E  13.167   8.47 17.87 0.000

Assumptions in ANOVA

Conditions required for a Valid ANOVA F-Test:

  1. Completely Randomized Design: The samples are randomly selected in an independent manner from the k treatment populations.
  2. All k sampled populations have distributions that are approximately normal.
  3. The k population variances are equal.

There are ways to check (both visually and nuemrically) if these assumptions are satisfied

Testing homogeneity of variances

The homogeneity of variance test is called Bartlett test (no need to remember)

bartlett.test(count ~ spray, data=InsectSprays)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  count by spray
## Bartlett's K-squared = 30, df = 5, p-value = 9e-05
  • Note that this implies a significant difference between the group variance, which means we cannot assume that the k population variances are equal.
  • We cannot trust the ANOVA results for this data

Model checking plots

results = aov(count ~ spray, data=InsectSprays)
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(results, las = 1)      # Residuals, Fitted, ...

par(opar)

Kruskal-Wallis Test

  • When the assumptions for a parametric ANOVA are violated, we use a nonparametric alternative.
  • For one-way ANOVA, the alternative is known as is known as Kruskal-Wallis test.
# Kruskal Wallis Test One Way Anova by Ranks 
kruskal.test(count ~ spray, data=InsectSprays) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  count by spray
## Kruskal-Wallis chi-squared = 50, df = 5, p-value = 2e-10

Reject the null hypothesis!

First example revisited

Testing homogeneity of variances

bartlett.test(pain ~ drug, data=migraine)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  pain by drug
## Bartlett's K-squared = 4, df = 2, p-value = 0.1
  • Note that although this does not imply a significant difference between the group variance, the P-value is low. We have to be careful!

Model checking plots

results = aov(pain ~ drug, data=migraine)
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(results, las = 1)      # Residuals, Fitted, ...

par(opar)

Kruskal-Wallis Test

  • When the assumptions for a parametric ANOVA are violated, we use a nonparametric alternative.
  • For one-way ANOVA, the alternative is known as is known as Kruskal-Wallis test.
# Kruskal Wallis Test One Way Anova by Ranks 
kruskal.test(pain ~ drug, data=migraine) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  pain by drug
## Kruskal-Wallis chi-squared = 10, df = 2, p-value = 7e-04

Reject the null hypothesis!

Let's discuss Kruskal-Wallis in more details!

In class example

We simulate data in three groups by sampling from three different Normal distributions with different means but the same variance. The observations for three groups are randomly sampled from from \(N(15, 9^2)\), \(N(25, 9^2)\) and \(N(30, 9^2)\).

## generate the simulation data
set.seed(1238991)
n=5
trt1 = rnorm(n, 15, 9)
trt2 = rnorm(n, 25, 9)
trt3 = rnorm(n, 30, 9)
x = c(trt1, trt2, trt3)
grps = rep(1:3, each=n)

Visualize (Boxplot)

ANOVA

#ANOVA based on the assumption of normal distribution with equal variance
summary(aov(x ~ factor(grps)))
##              Df Sum Sq Mean Sq F value Pr(>F)  
## factor(grps)  2    402   200.8    3.18  0.078 .
## Residuals    12    758    63.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kruskal-Wallis Test

options(digits=10)
# Kruskal-Wallis test with chi-square approximation
kruskal.test(x,grps)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  x and grps
## Kruskal-Wallis chi-squared = 6.62, df = 2, p-value = 0.03651617

Now compare the test statistics and the P-values that you have calculated.

Another way

Kruskal-Wallis test statistic value can also be calculated by the output of ANOVA test on the ranks!

## Calculate the ranks of x
rank.x = rank(x)
N = length(x)
summary(aov(rank.x ~ factor(grps)))
##              Df Sum Sq Mean Sq F value   Pr(>F)  
## factor(grps)  2  132.4    66.2 5.38211 0.021457 *
## Residuals    12  147.6    12.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although ANOVA does a different test, we can recover the value of the Kruskal-Wallis test statistics from the output.

Kruskal-Wallis from ANOVA

Here is the test statistics (SST)

SST = summary(aov(rank.x ~ factor(grps)))[[1]][1,2]
H = SST*12/(N*(N+1))
cat("Kruskal Wallis Test Statistics = ", H, "\n")
## Kruskal Wallis Test Statistics =  6.62
cat("P value = ", 1 -pchisq(H,2))
## P value =  0.03651617375