Bootstrap

2020-11-18

Bootstrap using R

Sampling distribution

If the population distribution is known, we can sample from it many times to get an idea of the sampling distribution.
Example: \(X_i \sim N(\mu, \sigma^2), i = 1, \ldots, n\).
Sampling distribution of \(\bar{X}\) is \(N(\mu, \sigma^2/n)\)
Also, we can draw from normal many times, and plot the histogram of all the means.

Sampling distribution Normal mean

set.seed(123)
x = rnorm(100,2,3)
xbar.smpls = rep(NA, 1000)
for (i in 1:1000)  xbar.smpls[i] = mean(rnorm(100,2,3))
hist(xbar.smpls, freq = FALSE)
curve(dnorm(x,mean=2,sd=3/10),add=T)

Use of sampling distribution

We can use the sampling distribution to get an idea about the fluctuation of the sample mean etc.
The true sampling mean should be 2, and true sampling std. dev. should be \(\sigma/\sqrt{n}= 3/\sqrt(100) = 0.3\).

mean(xbar.smpls)

## [1] 2.00255

sd(xbar.smpls)

## [1] 0.2851543

Bootstrap

If \(F\) is completely unknown, we cannot generate random numbers from this distribution.
\(X_1, \ldots, X_n\) are the only realizations of \(F\) known to us. The idea is to approximate \(F\) by \(\hat F\) based on \(X_1, \ldots, X_n\).
Basic idea: Inference about a population from sample data, (sample \(\rightarrow\) population), can be modeled by resampling the sample data and performing inference about a sample from resampled data, (resampled \(\rightarrow\) sample).

Algorithm

Data: \(X_1, \ldots, X_n\). Test statistics \(t_n\) (e.g. \(\bar{X}\))
Draw \(B\) samples \(x_1^b, x_2^b, \ldots, x_n^b\) with replacement \(b = 1, \ldots, B\) (there might be ties). This is called a Bootstrap sample!
For each of the \(B\) samples calculate the statistic \(t_n^b\).
Use these \(B\) realizations of \(t_n\) to compute whatever you want, e.g.

mean: \(E(t_n) \approx 1/B \sum_b t_n^b\).
variance: \(Var(t_n) \approx 1/(B-1) \sum_b (t_n^b - \bar{t}_n)^2\).
Empirical CDF \(\hat{F}(t_n)\).

Bootstrap in R - the sample() function

A major component of bootstrapping is being able to resample a given data set and in R the function which does this is the sample function.

sample(x, size, replace, prob)

The first argument is a vector containing the data set to be resampled or the indices of the data to be resampled.
The size option specifies the sample size with the default being the size of the population being resampled.
The replace option determines if the sample will be drawn with or without replacement where the default value is FALSE, i.e. without replacement.
The prob option takes a vector of length equal to the data set given in the first argument containing the probability of selection for each element of x.

More on sampling - 1

Using sample to generate a permutation of the sequence 1:10

sample(10)

##  [1]  4  1  5 10  3  9  8  6  7  2

Bootstrap sample from the same sequence (ties)

sample(10, replace=T)

##  [1]  1  9  5  8  1 10  7  5  5  6

sample(10, replace=T)

##  [1]  5 10 10  1  7  5  6  8  8 10

More on sampling - 2

Boostrap sample from the same sequence with probabilities that favor the numbers 1-5

prob1 <- c(rep(.15, 5), rep(.05, 5))
prob1

##  [1] 0.15 0.15 0.15 0.15 0.15 0.05 0.05 0.05 0.05 0.05

sample(10, replace=T, prob=prob1)

##  [1] 4 8 4 3 7 5 5 3 1 7

Bootstrap in R

Here is an R code for getting the bootstrap samples from the same \(x\) that we saw before.

B = 5000
theta.boot = rep(0, B)
n = length(x)
for (b in 1:B)  theta.boot[b] = mean(sample(x,replace=TRUE))

Bootstrap mean and sd

We can use the statistics calculated from the bootstrap distribution to get an idea about the fluctuation of the sample mean etc. when the true F is not known.

mean(theta.boot)

## [1] 2.271241

sd(theta.boot)

## [1] 0.2682298

The bootstrap distribution

Another Example - 0

Taken entirely from ATS UCLA website
In the following bootstrapping example we would like to obtain a standard error for the estimate of the median.
We will use the lapply and sapply functions.
lapply(X, FUN, …): returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
sapply(X, FUN, …) is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = “array”, an array if appropriate.

Another Example - 1

We want to calculating the standard error of the median for a data set created by taking 100 observations from a normal distribution with mean 5 and stdev 3.
We have rounded each observation to nearest integer!
The distribution of median is not as `simple’ as the mean.

data <- round(rnorm(100, 5, 3))
data[1:10]

##  [1]  6  4  9  4  1  4  2 11  9  8

Another Example - 2

We will generate only 20 bootstrap samples.
Show the first sample for demonstration.

resamples <- lapply(1:20, function(i)
sample(data, replace = T))
resamples[1]

## [[1]]
##   [1]  4  4  8  4  1 -2  5  5  4  7  4  6  6 11  1  5  6  5  9  4 10 -1  4  4  4
##  [26] 10  5  7  3  4  4  7  6  3  9  5  6  0  4  4  6  3  1 11  7  4  9  7  7 11
##  [51]  9 12  4  4  3  7  2  4  7  4  5  8  4  0  7  6  4  6  9  6  6  7  5  6 12
##  [76]  4  4  5  7 10  1  7  7 10  4  3  1  5  4  4  0  8  1  8  4  2  5  6 11 11

Another Example - 3

For each bootstrap samples, we will calculate the median.
Now we have a bootstrap sample from the distribution of the sample median, and we can use it to calculate the standard deviation.

r.median <- sapply(resamples, median)
r.median

##  [1] 5.0 5.0 5.0 6.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 4.5 5.0 5.0 5.0 6.0 5.0 5.0 5.0
## [20] 6.0

sqrt(var(r.median))

## [1] 0.3931988

Another Example - 4

hist(r.median, freq = FALSE, col = rgb(0,1,0,0.5))

Using package

# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data 
rsq <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
} 
# bootstrapping with 1000 replications 
results <- boot(data=mtcars, statistic=rsq, 
    R=1000, formula=mpg~wt+disp)

Using package

# view results
results

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
##     wt + disp)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.7809306 0.01138209  0.04917415

Using package

You can also get them separately:

(bs.median = results$t0)

## [1] 0.7809306

(bs.stderr = sqrt(var(results$t)))

##            [,1]
## [1,] 0.04917415

Plotting results

plot(results)

Plotting results

df <- data.frame(x = results$t); x<- df$x
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.3

ggplot(df, aes(x)) + geom_histogram(aes(y = ..density..), fill = "blue", alpha = 0.5)+geom_density()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Law data example

data(law, package = "bootstrap")
library(ggplot2)
qplot(LSAT, GPA, data = law)

with(law, cor(LSAT, GPA))

## [1] 0.7763745

Sample with replacement

sample(nrow(law), replace = TRUE)

##  [1]  3  1 10  1 14  2 13 14  3  5  3  3  1  8  5

sample(nrow(law), replace = TRUE)

##  [1]  7  6  3 12  8 12  5  7  9  2  3  4  4  4  3

with(law[sample(nrow(law), replace = TRUE), ], cor(LSAT, GPA))

## [1] 0.6162869

with(law[sample(nrow(law), replace = TRUE), ], cor(LSAT, GPA))

## [1] 0.7647155

with(law[sample(nrow(law), replace = TRUE), ], cor(LSAT, GPA))

## [1] 0.8670833

Bootstrap (Long code)

ptm <- proc.time()
B <- 5000
cor.boot <- rep(0,B)
for (b in 1:B){
  law.boot <- law[sample(nrow(law), replace = TRUE), ]
  cor.boot[b] = cor(law.boot$LSAT, law.boot$GPA)
}
summary(cor.boot)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00217 0.69120 0.78835 0.76975 0.87130 0.99465

proc.time()-ptm

##    user  system elapsed 
##    0.73    0.00    0.73

Bootstrap (Short code)

ptm <- proc.time()
B <- 5000
cors <- replicate(B, {with(law[sample(nrow(law), replace = TRUE), ], 
                           cor(LSAT, GPA))})
summary(cors)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1565  0.6890  0.7890  0.7707  0.8721  0.9917

proc.time()-ptm

##    user  system elapsed 
##    0.69    0.00    0.69

Slight gain in running time! Gets better with bigger data.

Plotting results

df <- data.frame(x = cors); x<- df$x
library(ggplot2)
ggplot(df, aes(x)) + geom_histogram(aes(y = ..density..), fill = "blue", alpha = 0.5)+geom_density()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How many bootstrap samples do we need?

We will perform Bootstrap sampling for different values of \(B\), the number of Bootstrap samples and observe the fluctuation between bootstrap estimates for each choice.
Our preferred \(B\) should lead to a low variation.
We will show the standard loop-y code, which is not efficient and a fast code.

Design

Bs <- rep(100 * (1:50), 3)
Bs

##   [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400
##  [15] 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800
##  [29] 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200
##  [43] 4300 4400 4500 4600 4700 4800 4900 5000  100  200  300  400  500  600
##  [57]  700  800  900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
##  [71] 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400
##  [85] 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800
##  [99] 4900 5000  100  200  300  400  500  600  700  800  900 1000 1100 1200
## [113] 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600
## [127] 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000
## [141] 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000

How many? (slow code)

ptm <- proc.time()
corsize2 = matrix(0,length(Bs),2)
for (i in 1:length(Bs)){
  b = Bs[i]
  for (b in 1:B){
   law.boot <- law[sample(nrow(law), replace = TRUE), ]
   cor.boot[b] = cor(law.boot$LSAT, law.boot$GPA)
  }
  corsize2[i,] = c(b, mean(cor.boot))
}
colnames(corsize2) = c("B","cor")     
proc.time()-ptm

##    user  system elapsed 
##  109.58    0.33  121.28

head(corsize2, n = 3)

##         B       cor
## [1,] 5000 0.7674854
## [2,] 5000 0.7736561
## [3,] 5000 0.7734934

Fast code for Bootstrap sample size

The idea is to calculate the bootstrap correlation for each sample size from 100 to 5000, and replicate three times.
The slow code uses a nested for loop, which is really inefficient.
We can use a function called ldply that can avoid for loops altogether.
ldply function : For each element of a list, apply function then combine results into a data frame.
Our list is values of B, for each B, we will calculate mean of correlation estimates for a bootstrap with B samples.

How many? (Fast Code)

ptm <- proc.time()
library(plyr)
corsize <- ldply(Bs, function(B){
                 cors <- replicate(B,{with(law[sample(nrow(law), replace = TRUE),]                                      ,cor(LSAT, GPA))})
                 c(B = B, cor = mean(cors))})
head(corsize,n=3)

##     B       cor
## 1 100 0.7676786
## 2 200 0.7787451
## 3 300 0.7638906

proc.time()-ptm

##    user  system elapsed 
##   52.80    0.08   54.36

Plot

Plot the correlation estimates for each value of \(B\) for three choices.
As \(B\) increases, fluctuation diminishes.

qplot(B, cor, data = corsize)