- Music and Wine as well Pet ownership examples taken from an old introductory Stat class at Purdue.
- The shopping example is taken from the tutorials by William B. King:
November 9, 2018
row1 = c(30,39,30) row2 = c(11,1,19) row3 = c(45,35,35) music.wine = rbind(row1,row2,row3) dimnames(music.wine) = list(wine = c("french","italian","other"), music=c("None","french","italian")) music.wine
## music ## wine None french italian ## french 30 39 30 ## italian 11 1 19 ## other 45 35 35
addmargins(music.wine)
## music ## wine None french italian Sum ## french 30 39 30 99 ## italian 11 1 19 31 ## other 45 35 35 115 ## Sum 86 75 84 245
prop.table(music.wine, 1)
## music ## wine None french italian ## french 0.3030303 0.39393939 0.3030303 ## italian 0.3548387 0.03225806 0.6129032 ## other 0.3913043 0.30434783 0.3043478
barplot(music.wine, beside = T)
barplot(t(music.wine), beside=T, legend=T, ylim=c(0,100), ylab="Observed frequencies in sample", main="Frequency of Wine Purchase by Music")
chisq.test(music.wine)
## ## Pearson's Chi-squared test ## ## data: music.wine ## X-squared = 18.822, df = 4, p-value = 0.0008519
row1 = c(28,50) row2 = c(11,3) pet = rbind(row1,row2) dimnames(pet) = list(survival = c("alive","dead"), petownership=c("No","Yes")) addmargins(pet)
## petownership ## survival No Yes Sum ## alive 28 50 78 ## dead 11 3 14 ## Sum 39 53 92
prop.table(pet)
## petownership ## survival No Yes ## alive 0.3043478 0.5434783 ## dead 0.1195652 0.0326087
chisq.test(pet)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: pet ## X-squared = 7.1899, df = 1, p-value = 0.007331
R applies the Yates' continuity correction automatically for 2x2 table. We can set it off, but we shouldn't.
chisq.test(pet, correct = F)
## ## Pearson's Chi-squared test ## ## data: pet ## X-squared = 8.8511, df = 1, p-value = 0.002929
We have enough evidence to say that there is an association between pet ownership and patient status in the population.
It is appropriate to use the chi-square test because none of the cells have expected counts < 5.
chisq.test(pet)$expected
## petownership ## survival No Yes ## alive 33.065217 44.934783 ## dead 5.934783 8.065217
The following data are from a Stanford University study of the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. These data were found in Verzani (2005, p.262f), and the analysis here is similar to his.
outcome ----------------- treat worse same better --------------------------- Celexa 2 3 7 placebo 2 8 2 ---------------------------
freqs = c(2, 2, 3, 8, 7, 2) # entered "down the columns" shopping = matrix(freqs, nrow=2) #fills down columns by default dimnames(shopping) = list(treatment=c("Celexa","placebo"), outcome=c("worse","same","better")) addmargins(prop.table(shopping))
## outcome ## treatment worse same better Sum ## Celexa 0.08333333 0.1250000 0.29166667 0.5 ## placebo 0.08333333 0.3333333 0.08333333 0.5 ## Sum 0.16666667 0.4583333 0.37500000 1.0
chisq.test(shopping)
## Warning in chisq.test(shopping): Chi-squared approximation may be incorrect
## ## Pearson's Chi-squared test ## ## data: shopping ## X-squared = 5.0505, df = 2, p-value = 0.08004
chisq.test(shopping)$expected
## Warning in chisq.test(shopping): Chi-squared approximation may be incorrect
## outcome ## treatment worse same better ## Celexa 2 5.5 4.5 ## placebo 2 5.5 4.5
chisq.test(shopping, simulate.p.value=T)
## ## Pearson's Chi-squared test with simulated p-value (based on 2000 ## replicates) ## ## data: shopping ## X-squared = 5.0505, df = NA, p-value = 0.1104
fisher.test(shopping)
## ## Fisher's Exact Test for Count Data ## ## data: shopping ## p-value = 0.07303 ## alternative hypothesis: two.sided
Streissguth et al. (1984) investigated the effect of alcohol and nicotine consumption during pregnancy on the resulting children by examining the children's attention span and reaction time at age four. First, the 452 mothers in the study were classified as shown in Table 2.1 according to their levels of consumption of alcohol and nicotine. Test the null hypothesis of no association between levels of consumption.
## nicotine ## alcohol None 1-15 16 or more ## none 105 7 11 ## 0.01-0.1 58 5 13 ## 0.11-0.99 84 37 42 ## 1.00 or more 37 16 17
chisq.test(alcohol, correct = F)
## ## Pearson's Chi-squared test ## ## data: alcohol ## X-squared = 46.74, df = 6, p-value = 2.109e-08
None of the cells have expected counts < 5.
chisq.test(alcohol)$expected
## nicotine ## alcohol None 1-15 16 or more ## none 80.86111 18.50694 23.63194 ## 0.01-0.1 49.96296 11.43519 14.60185 ## 0.11-0.99 107.15741 24.52546 31.31713 ## 1.00 or more 46.01852 10.53241 13.44907
In R, use the package "vcd", function assocstats()
In case of a 2-dimensional table, a list with components:
chisq_tests | a 2 x 3 table with the chi-squared statistics |
---|---|
phi | The absolute value of the phi coefficient (only defined for 2 x 2 tables) |
cont | The contingency coefficient |
cramer | Cramer's V |
#install.packages("vcd") library(vcd)
## Loading required package: grid
assocstats(music.wine)
## X^2 df P(> X^2) ## Likelihood Ratio 22.285 4 0.00017585 ## Pearson 18.822 4 0.00085186 ## ## Phi-Coefficient : NA ## Contingency Coeff.: 0.267 ## Cramer's V : 0.196
#install.packages("vcd") library(vcd) assocstats(pet)
## X^2 df P(> X^2) ## Likelihood Ratio 9.0113 1 0.0026832 ## Pearson 8.8511 1 0.0029291 ## ## Phi-Coefficient : 0.31 ## Contingency Coeff.: 0.296 ## Cramer's V : 0.31
R gives us \(\phi\) because it is a 2x2 table.
#install.packages("vcd") library(vcd) assocstats(shopping)
## X^2 df P(> X^2) ## Likelihood Ratio 5.3002 2 0.070644 ## Pearson 5.0505 2 0.080038 ## ## Phi-Coefficient : NA ## Contingency Coeff.: 0.417 ## Cramer's V : 0.459
#install.packages("vcd") library(vcd) assocstats(alcohol)
## X^2 df P(> X^2) ## Likelihood Ratio 49.816 6 5.1184e-09 ## Pearson 46.740 6 2.1088e-08 ## ## Phi-Coefficient : NA ## Contingency Coeff.: 0.312 ## Cramer's V : 0.233
Some of the best examples of contingency tables come from Political data analysis.
Example: Analyzing Exit Poll Data from CNN. Total number of respondents = 24537.
Party | 18-24 | 25-29 | 30-39 | 40-49 | 50-64 | 65 and older |
---|---|---|---|---|---|---|
Clinton | 56% | 53% | 51% | 46% | 44% | 45% |
Trump | 35% | 39% | 40% | 50% | 53% | 53% |
clinton = c(1374,1170,2127,2145,3239,1656) trump = c(859,861,1669,2331,3901,1951) elect16= rbind(clinton,trump) dimnames(elect16) = list(candidate = c("clinton","trump"), agegp = c("18-24","25-29","30-39","40-49","50-64",">65")) barplot(elect16, beside=T, legend=T)
chisq.test(elect16)
## ## Pearson's Chi-squared test ## ## data: elect16 ## X-squared = 313.46, df = 5, p-value < 2.2e-16
Party | clinton | trump | others |
---|---|---|---|
male | 41% | 53% | 6% |
female | 54% | 42% | 4% |
gender <- matrix(c(4829,6242,707,6890,5359,510), byrow = T, ncol = 3) dimnames(gender) = list(gender = c("male","female"), candidate = c("clinton","trump","others")) barplot(t(gender), beside=T, legend=T)
gender <- matrix(c(4829,6242,707,6890,5359,510), byrow = T, ncol = 3) dimnames(gender) = list(gender = c("male","female"), candidate = c("clinton","trump","others")) prop.table(gender, 1)
## candidate ## gender clinton trump others ## male 0.4100017 0.5299711 0.06002717 ## female 0.5400110 0.4200172 0.03997178
chisq.test(gender)
## ## Pearson's Chi-squared test ## ## data: gender ## X-squared = 423.02, df = 2, p-value < 2.2e-16
tea<- matrix(c(3, 1, 1, 3), ncol= 2, dimnames = list(Truth = c("Tea","Milk" ), Lady_says = c("Tea first","Milk first"))) tea
## Lady_says ## Truth Tea first Milk first ## Tea 3 1 ## Milk 1 3
fisher.test(tea)
## ## Fisher's Exact Test for Count Data ## ## data: tea ## p-value = 0.4857 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.2117329 621.9337505 ## sample estimates: ## odds ratio ## 6.408309
fisher.test(tea, alternative = "greater")
## ## Fisher's Exact Test for Count Data ## ## data: tea ## p-value = 0.2429 ## alternative hypothesis: true odds ratio is greater than 1 ## 95 percent confidence interval: ## 0.3135693 Inf ## sample estimates: ## odds ratio ## 6.408309
The data in table 1 come from an RCT comparing intramuscular magnesium injections with placebo for the treatment of chronic fatigue syndrome Of the 15 patients who had the intra-muscular magnesium injections 12 felt better (80 per cent) whereas, of the 17 on placebo, only three felt better (18 per cent).
Magnesium | Placebo | |
---|---|---|
Felt Better |
12 |
3 |
Fel worse |
3 |
14 |
RCT <- matrix(c(12,3,3,14), ncol= 2, dimnames = list(Outcome = c("Better", "Worse" ), Medicine = c("Magnesium", "Placebo"))) chisq.test(RCT)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: RCT ## X-squared = 10.063, df = 1, p-value = 0.001512
chisq.test(RCT)$expected
## Medicine ## Outcome Magnesium Placebo ## Better 7.03125 7.96875 ## Worse 7.96875 9.03125
The expected counts are all more than 5 so a chi-sqaure approximation would be valid here, but, we can still do Fisher's exact test to get the exact P-value.
fisher.test(RCT)
## ## Fisher's Exact Test for Count Data ## ## data: RCT ## p-value = 0.001033 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 2.507119 159.407427 ## sample estimates: ## odds ratio ## 16.41917
cold <- matrix(c(31,109,17,122), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) cold
## Medicine ## Outcome Cold No Cold ## Placebo 31 17 ## Ascorbic Acid 109 122
cold <- matrix(c(31,109,17,122), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) chisq.test(cold)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: cold ## X-squared = 4.1407, df = 1, p-value = 0.04186
cold <- matrix(c(310,1090,170,1220), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) chisq.test(cold)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: cold ## X-squared = 47.421, df = 1, p-value = 5.727e-12
cold <- matrix(c(3,11, 2,12), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) chisq.test(cold)
## Warning in chisq.test(cold): Chi-squared approximation may be incorrect
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: cold ## X-squared = 0, df = 1, p-value = 1
If odds < 1, then "success" is less likely than "failure".
Let "C" = cold, "NC" = no cold, "Pl" = Placebo, "VitC" = Vitamin C.
The odds ratio, is the ratio of odds1 and odds2: \[ OR = \frac{P( Z = C | Y = Pl) \times P(Z = NC | Y = VitC)}{P(Z = NC | Y = Pl)\times P( Z = C | Y = VitC)} \]
Z = 1 | Z = 2 | |
---|---|---|
Y = 1 |
\(n_{11}\),\(\pi_{11}\) |
\(n_{12}\),\(\pi_{12}\) |
Y = 2 |
\(n_{12}\),\(\pi_{12}\) |
\(n_{22}\),\(\pi_{22}\) |
\[ \theta = \mbox{Odds ratio} = \frac{\pi_{11}\pi_{22}}{\pi_{12}\pi_{21}} \] Natural estimate: \[ \hat{\theta} = \mbox{Estimated Odds ratio} = \frac{n_{11}n_{22}}{n_{12}n_{21}} \]
cold <- matrix(c(31,109,17,122), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) OR = cold[1,1]*cold[2,2]/(cold[1,2]*cold[2,1]); cat(OR)
## 2.041015
cold <- matrix(c(310,1090,170,1220), ncol= 2, dimnames = list(Outcome = c("Placebo", "Ascorbic Acid" ), Medicine = c("Cold", "No Cold"))) OR = cold[1,1]*cold[2,2]/(cold[1,2]*cold[2,1]); cat(OR)
## 2.041015
## Medicine ## Outcome Cold No Cold ## Placebo 31 17 ## Ascorbic Acid 109 122
OR = cold[1,1]*cold[2,2]/(cold[1,2]*cold[2,1]); cat("Odds ratio", OR)
## Odds ratio 2.041015
For our example, \(OR = 2.04\) means that:
sales.data <- matrix(c(5, 13, 4, 6), ncol= 2, dimnames = list(Before = c("Acceptable","Unacceptable" ), After = c("Acceptable","Unacceptable" ))) sales.data
## After ## Before Acceptable Unacceptable ## Acceptable 5 4 ## Unacceptable 13 6
mcnemar.test(sales.data , correct = FALSE)
## ## McNemar's Chi-squared test ## ## data: sales.data ## McNemar's chi-squared = 4.7647, df = 1, p-value = 0.02905
# install.packages("exact2x2") library(exact2x2) mcnemar.exact(sales.data)
## ## Exact McNemar test (with central confidence intervals) ## ## data: sales.data ## b = 4, c = 13, p-value = 0.04904 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.07308542 0.99598118 ## sample estimates: ## odds ratio ## 0.3076923