Modern Regression 1 Subset Selection & Ridge

March 30, 2020

Source

The R codes for the baseball example and the data are taken from the ISLR book by James, Witten, Hastie and Tibshirani.

Baseball data

Major League Baseball Data from the 1986 and 1987 seasons.

library(ISLR)
#fix(Hitters)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

dim(Hitters)

## [1] 322  20

Remove NA’s

Salary is missing for some players. The na.omit() function removes all of the rows that have missing values in any variable.

sum(is.na(Hitters$Salary))

## [1] 59

Hitters=na.omit(Hitters)
dim(Hitters)

## [1] 263  20

Best Subset Selection

library(leaps)
regfit.full=regsubsets(Salary~.,Hitters)
#summary(regfit.full)

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS.

Best Subset Selection

By default, regsubsets() reports up to the best eight-variable model, which we can change using the nvmax argument.

regfit.full=regsubsets(Salary~.,data=Hitters,nvmax=19)
reg.summary=summary(regfit.full)

Best Subset Selection

The summary() function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC. We can examine these to try to select the best overall model.

names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

reg.summary$rsq

##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

The \(R^2\) values increase monotonically !

Choosing the best model

Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select:

par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
plot(regfit.full,scale="r2")

Choosing the best model

Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select:

Choosing the best model

The functions which.max() and which.min() helps us select the maximum point of a vector.

which.max(reg.summary$adjr2)

## [1] 11

which.min(reg.summary$cp)

## [1] 10

which.min(reg.summary$bic)

## [1] 6

Now plot everything together

Forward

regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
fwd.summary <- summary(regfit.fwd)

Plots

Best model

which.max(fwd.summary$adjr2)

## [1] 11

which.min(fwd.summary$cp)

## [1] 10

which.min(fwd.summary$bic)

## [1] 6

Now plot everything together

Backward

regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
bwd.summary <- summary(regfit.bwd)

Best model

which.max(bwd.summary$adjr2)

## [1] 11

which.min(bwd.summary$cp)

## [1] 10

which.min(bwd.summary$bic)

## [1] 8

Now plot everything together

Different Models

coef(regfit.full,7)

##  (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
##   79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
##    DivisionW      PutOuts 
## -129.9866432    0.2366813

coef(regfit.fwd,7)

##  (Intercept)        AtBat         Hits        Walks         CRBI       CWalks 
##  109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 
##    DivisionW      PutOuts 
## -127.1223928    0.2533404

coef(regfit.bwd,7)

##  (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
##  105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095   -0.7163346 
##    DivisionW      PutOuts 
## -116.1692169    0.3028847

Ridge Regression

Ridge, Lasso, Elastic Net are all based on R package glmnet.
Elastic Net Penalty: \[ P_{\lambda,\alpha}(\beta) = \frac{(1-\alpha)}{2} {\lVert \beta \rVert}_2^2 + \alpha {\lVert \beta \rVert}_1^2 \]
\(\alpha = 0\) is Ridge, \(\alpha = 1\) is Lasso, anything in between is Elastic Net.

R commands

x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary
dim(x)

## [1] 263  19

dim(y)

## NULL

library(glmnet)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))

## [1]  20 100

Solutions

When \(\lambda\) = 1.14975710^{4} :

ridge.mod$lambda[50]

## [1] 11497.57

coef(ridge.mod)[,50]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
## 407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
##  -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531

sqrt(sum(coef(ridge.mod)[-1,50]^2))

## [1] 6.360612

Solutions

When \(\lambda\) = 705.4802311 :

ridge.mod$lambda[60]

## [1] 705.4802

coef(ridge.mod)[,60]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##   1.31987948   2.59640425   0.01083413   0.04674557   0.33777318   0.09355528 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##   0.09780402   0.07189612  13.68370191 -54.65877750   0.11852289   0.01606037 
##       Errors   NewLeagueN 
##  -0.70358655   8.61181213

sqrt(sum(coef(ridge.mod)[-1,60]^2))

## [1] 57.11001