The R codes for the baseball example and the data are taken from the ISLR book by James, Witten, Hastie and Tibshirani.
March 30, 2020
The R codes for the baseball example and the data are taken from the ISLR book by James, Witten, Hastie and Tibshirani.
Major League Baseball Data from the 1986 and 1987 seasons.
library(ISLR) #fix(Hitters) names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks" ## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI" ## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors" ## [19] "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
Salary is missing for some players. The na.omit() function removes all of the rows that have missing values in any variable.
sum(is.na(Hitters$Salary))
## [1] 59
Hitters=na.omit(Hitters) dim(Hitters)
## [1] 263 20
library(leaps) regfit.full=regsubsets(Salary~.,Hitters) #summary(regfit.full)
The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS.
By default, regsubsets() reports up to the best eight-variable model, which we can change using the nvmax argument.
regfit.full=regsubsets(Salary~.,data=Hitters,nvmax=19) reg.summary=summary(regfit.full)
The summary() function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC. We can examine these to try to select the best overall model.
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227 ## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164 ## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
The \(R^2\) values increase monotonically !
Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select:
par(mfrow=c(2,2)) plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l") plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l") plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l') plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l') plot(regfit.full,scale="r2")
Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select:
The functions which.max() and which.min() helps us select the maximum point of a vector.
which.max(reg.summary$adjr2)
## [1] 11
which.min(reg.summary$cp)
## [1] 10
which.min(reg.summary$bic)
## [1] 6
regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward") fwd.summary <- summary(regfit.fwd)
which.max(fwd.summary$adjr2)
## [1] 11
which.min(fwd.summary$cp)
## [1] 10
which.min(fwd.summary$bic)
## [1] 6
regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward") bwd.summary <- summary(regfit.bwd)
which.max(bwd.summary$adjr2)
## [1] 11
which.min(bwd.summary$cp)
## [1] 10
which.min(bwd.summary$bic)
## [1] 8
coef(regfit.full,7)
## (Intercept) Hits Walks CAtBat CHits CHmRun ## 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538 ## DivisionW PutOuts ## -129.9866432 0.2366813
coef(regfit.fwd,7)
## (Intercept) AtBat Hits Walks CRBI CWalks ## 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070 ## DivisionW PutOuts ## -127.1223928 0.2533404
coef(regfit.bwd,7)
## (Intercept) AtBat Hits Walks CRuns CWalks ## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346 ## DivisionW PutOuts ## -116.1692169 0.3028847
Ridge, Lasso, Elastic Net are all based on R package glmnet.
Elastic Net Penalty: \[ P_{\lambda,\alpha}(\beta) = \frac{(1-\alpha)}{2} {\lVert \beta \rVert}_2^2 + \alpha {\lVert \beta \rVert}_1^2 \]
\(\alpha = 0\) is Ridge, \(\alpha = 1\) is Lasso, anything in between is Elastic Net.
x=model.matrix(Salary~.,Hitters)[,-1] y=Hitters$Salary dim(x)
## [1] 263 19
dim(y)
## NULL
library(glmnet) grid=10^seq(10,-2,length=100) ridge.mod=glmnet(x,y,alpha=0,lambda=grid) dim(coef(ridge.mod))
## [1] 20 100
When \(\lambda\) = 1.14975710^{4} :
ridge.mod$lambda[50]
## [1] 11497.57
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs ## 407.356050200 0.036957182 0.138180344 0.524629976 0.230701523 ## RBI Walks Years CAtBat CHits ## 0.239841459 0.289618741 1.107702929 0.003131815 0.011653637 ## CHmRun CRuns CRBI CWalks LeagueN ## 0.087545670 0.023379882 0.024138320 0.025015421 0.085028114 ## DivisionW PutOuts Assists Errors NewLeagueN ## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 6.360612
When \(\lambda\) = 705.4802311 :
ridge.mod$lambda[60]
## [1] 705.4802
coef(ridge.mod)[,60]
## (Intercept) AtBat Hits HmRun Runs RBI ## 54.32519950 0.11211115 0.65622409 1.17980910 0.93769713 0.84718546 ## Walks Years CAtBat CHits CHmRun CRuns ## 1.31987948 2.59640425 0.01083413 0.04674557 0.33777318 0.09355528 ## CRBI CWalks LeagueN DivisionW PutOuts Assists ## 0.09780402 0.07189612 13.68370191 -54.65877750 0.11852289 0.01606037 ## Errors NewLeagueN ## -0.70358655 8.61181213
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 57.11001