Modern Regression 2 Ridge and LASSO

April 1, 2020

Source

The R codes for the baseball example and the data are taken from the ISLR book by James, Witten, Hastie and Tibshirani.

This is a low-dimensional example, which means \(p < n\). We will see how LASSO and Ridge compares with OLS on this data-set.

We will see a high-dimensional example after this. Of course, there isn’t a unique OLS for high-dimensional data.

Baseball data

Major League Baseball Data from the 1986 and 1987 seasons.

library(ISLR)
#fix(Hitters)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

dim(Hitters)

## [1] 322  20

Remove NA’s

Salary is missing for some players. The na.omit() function removes all of the rows that have missing values in any variable.

sum(is.na(Hitters$Salary))

## [1] 59

Hitters=na.omit(Hitters)
dim(Hitters)

## [1] 263  20

Ridge Regression

Ridge, Lasso, Elastic Net are all based on R package glmnet.
Elastic Net Penalty: \[ P_{\lambda,\alpha}(\beta) = \lambda \left \{ \frac{(1-\alpha)}{2} {\lVert \beta \rVert}_2^2 + \alpha {\lVert \beta \rVert}_1 \right \} \]
\(\alpha = 0\) is Ridge, \(\alpha = 1\) is Lasso, anything in between (\(0 < \alpha < 1\) is called Elastic Net.
\(\lambda = 0\) is ordinary least squares.
‘Elastic Net’ is a mix of Lasso and Ridge - it works better when the predictor variables are ‘grouped’.
All the three methods can be fit using the glmnet package.

R commands

x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary
dim(x)

## [1] 263  19

length(y)

## [1] 263

Fitting Ridge

We’ll specify \(\alpha = 0\) and calculate the ridge solution for \(100\) different values of \(\lambda\), chosen by us.

library(glmnet)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))

## [1]  20 100

Dependence on \(\lambda\)

When \(\lambda\) = 8.111308310^{8} :

ridge.mod$lambda[10]

## [1] 811130831

coef(ridge.mod)[,10]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  5.359236e+02  6.710939e-07  2.434358e-06  9.809139e-06  4.116682e-06 
##           RBI         Walks         Years        CAtBat         CHits 
##  4.348509e-06  5.117929e-06  2.093010e-05  5.761989e-08  2.120577e-07 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  1.599207e-06  4.254349e-07  4.390581e-07  4.645197e-07 -7.150687e-06 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -9.625147e-05  2.687955e-07  4.390401e-08 -2.047105e-07 -1.420491e-06

sqrt(sum(coef(ridge.mod)[-1,10]^2))

## [1] 9.961687e-05

Dependence on \(\lambda\)

When \(\lambda\) = 1.14975710^{4} :

ridge.mod$lambda[50]

## [1] 11497.57

coef(ridge.mod)[,50]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
## 407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
##  -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531

sqrt(sum(coef(ridge.mod)[-1,50]^2))

## [1] 6.360612

Dependence on \(\lambda\)

When \(\lambda\) = 0.01 :

ridge.mod$lambda[100]

## [1] 0.01

coef(ridge.mod)[,100]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  164.11321606   -1.97386151    7.37772270    3.93660219   -2.19873625 
##           RBI         Walks         Years        CAtBat         CHits 
##   -0.91623008    6.20037718   -3.71403424   -0.17510063    0.21132772 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.05629004    1.36605490    0.70965516   -0.79582173   63.40493257 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -117.08243713    0.28202541    0.37318482   -3.42400281  -25.99081928

sqrt(sum(coef(ridge.mod)[-1,100]^2))

## [1] 136.2012

Solution path

Choosing the best solution

We will use cross-validation to choose the best \(\lambda\) that minimizes the test error.
We have discussed LOO-CV and \(k\)-fold CV in class. In general, LOO-CV has higher variance but \(k\)-fold is more stable.
We’ll first use a single validation set and then show CV.
Also, note that if you are splitting randomly, you should fix the seed to reproduce your results.

Ridge with CV

Now split the samples into a training set and a test set in order to estimate the test error of ridge regression.
Two ways to split:
1. Produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data.
2. Randomly choose a subset of numbers between 1 and n - these can then be used as the indices for the training observations.

Traing and Test

set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]
dim(x[train,])

## [1] 131  19

dim(x[test,])

## [1] 132  19

Fit on training

Next we fit a ridge regression model on the training set, and evaluate its MSE on the test set, using \(\lambda = 4\).
The predict() function : here we get predictions for a test set, by replacing type="coefficients" with the newx argument.
Specify s = 4 to fix \(\lambda = 4\).

ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)
ridge.pred=predict(ridge.mod,s=4,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 142199.2

Fit

The test MSE is 1.421991510^{5}

If we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations:

mean((mean(y[train])-y.test)^2)

## [1] 224669.9

Another way

We could also get the same result by fitting a ridge regression model with a very large value of \(\lambda\). Note that s=1e10 means \(\lambda = 10^{10}\).

ridge.pred=predict(ridge.mod,s=1e10,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 224669.8

So fitting a ridge regression model with \(\lambda = 4\) leads to a much lower test MSE than fitting a model with just an intercept.

Obtaining least squres

Is it better than just performing least squares regression?
Ridge regression reduces to OLS for \(\lambda = 0\).

ridge.pred=predict(ridge.mod,s=0,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 167789.8

Ridge Regression MSE was 101036. OLS is worse.

CV to choose \(\lambda\)

In general, instead of arbitrarily choosing \(\lambda = 4\), it would be better to use cross-validation to choose the tuning parameter.
We can do this using the built-in cross-validation function, cv.glmnet().
By default, the function cv.glmnet() performs ten-fold cross-validation, though this can be changed using the argument nfolds.

Best lambda

set.seed(1) 
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)

Best lambda

bestlam=cv.out$lambda.min
bestlam

## [1] 326.0828

The value of \(\lambda\) that yields lowest CV error is 326.0827885, very different from 4.

What happens to the MSE?

We should calculate the test error at this “best” \(\lambda\) and check if it’s improving the fit.

bestlam=cv.out$lambda.min
bestlam

## [1] 326.0828

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 139856.6

It does reduce the test MSE.

The Coefficients

out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam)[1:20,]

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  15.44383135   0.07715547   0.85911581   0.60103107   1.06369007   0.87936105 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##   1.62444616   1.35254780   0.01134999   0.05746654   0.40680157   0.11456224 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##   0.12116504   0.05299202  22.09143189 -79.04032637   0.16619903   0.02941950 
##       Errors   NewLeagueN 
##  -1.36092945   9.12487767

As expected, none of the coefficients are zero : Ridge regression does not perform any variable selection.

Lasso

Lasso Penalty

Recall: Ridge, Lasso, Elastic Net are all based on R package glmnet.
Elastic Net Penalty: \[ P_{\lambda,\alpha}(\beta) = \lambda \left \{ \frac{(1-\alpha)}{2} {\lVert \beta \rVert}_2^2 + \alpha {\lVert \beta \rVert}_1 \right \} \]
\(\alpha = 0\) is Ridge, \(\alpha = 1\) is Lasso, anything in between is Elastic Net.
All we need to do is change alpha = 1 in glmnet function call.

Lasso in R

lasso.mod =glmnet (x[train ,], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

Choosing lambda by 10-fold CV

set.seed (1)
cv.out =cv.glmnet( x[train ,], y[train], alpha = 1)
plot(cv.out)

Choosing lambda by 10-fold CV

bestlam =cv.out$lambda.min
lasso.pred=predict(lasso.mod, s = bestlam , newx = x[test,])
mean((lasso.pred - y.test)^2)

## [1] 143673.6

This is substantially lower than the test set MSE of the null model and of least squares, and very similar to the test MSE of ridge regression with \(\lambda\) chosen by cross-validation.

Sparse Model

out=glmnet (x,y, alpha = 1, lambda =grid)
lasso.coef=predict(out, type = "coefficients", s = bestlam )[1:20,]
lasso.coef

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##    1.27479059   -0.05497143    2.18034583    0.00000000    0.00000000 
##           RBI         Walks         Years        CAtBat         CHits 
##    0.00000000    2.29192406   -0.33806109    0.00000000    0.00000000 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.02825013    0.21628385    0.41712537    0.00000000   20.28615023 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -116.16755870    0.23752385    0.00000000   -0.85629148    0.00000000

Sparse Model

Here we see that 12 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only seven variables.

lasso.coef[lasso.coef !=0]

##   (Intercept)         AtBat          Hits         Walks         Years 
##    1.27479059   -0.05497143    2.18034583    2.29192406   -0.33806109 
##        CHmRun         CRuns          CRBI       LeagueN     DivisionW 
##    0.02825013    0.21628385    0.41712537   20.28615023 -116.16755870 
##       PutOuts        Errors 
##    0.23752385   -0.85629148

Large \(p\), small \(n\)

Generate \(X\), \(y\) with \(n = 30, p = 60\) with a sparse \(\beta\) vector with 5 non-zero elements.
Generate \(X\) matrix as \(n \times p\) matrix with all entries from a \(N(0,1)\) distribution.
Take \(\beta = (3,\ldots,3, 0, \ldots, 0)\), and a small error distribution \(\epsilon \sim N(0,0.1)\).

Generating the data

library(Matrix)
set.seed(12345)
n = 30;p = 60
p1 = 5
beta <- c(rep(3,p1),rep(0,p-p1))
x=scale(matrix(rnorm(n*p),n,p))
eps=rnorm(n,mean=0,sd = 0.1)
fx = x %*% beta
y=drop(fx+eps)

Our goal is to see if Lasso can recover the 5 non-zero \(\beta\)’s correctly.

Ordinary Least Squares

The OLS Solution will not work here, as expected.

xtx = t(x)%*%x
y1 <- t(x) %*% y 
betahat.1 <- solve(xtx,y1)

## Error in solve.default(xtx, y1): system is computationally singular: reciprocal condition number = 8.24087e-19

Does Pseudo-Inverse Work?

The Moore-Penrose g-inverse is unstable and does not recover the true sparsity pattern.

library(MASS)
xtx = t(x)%*%x
y1 <- t(x) %*% y 
xtx <- as.matrix(xtx)
betahat <- ginv(xtx) %*% y1 
t(betahat)

##          [,1]     [,2]     [,3]     [,4]    [,5]       [,6]      [,7]      [,8]
## [1,] 1.608659 2.055189 2.008242 1.483109 2.07467 -0.3614589 0.7423397 0.6313197
##            [,9]      [,10]      [,11]     [,12]     [,13]     [,14]      [,15]
## [1,] -0.5369256 -0.6536013 -0.4607142 0.3530865 0.2274444 0.2260297 -0.4165202
##           [,16]      [,17]     [,18]      [,19]     [,20]     [,21]       [,22]
## [1,] -0.1228928 -0.9067978 0.1028133 -0.6649684 0.1944959 0.1970185 0.006585257
##          [,23]      [,24]     [,25]     [,26]      [,27]      [,28]     [,29]
## [1,] 0.5016977 -0.1721992 0.1778885 0.3185551 -0.3519374 -0.6238105 0.1718405
##            [,30]     [,31]     [,32]      [,33]     [,34]      [,35]      [,36]
## [1,] -0.05045217 0.5132045 0.4552682 -0.2886573 0.2954796 -0.5662579 -0.1817709
##            [,37]     [,38]      [,39]     [,40]      [,41]      [,42]     [,43]
## [1,] -0.09759172 0.1950746 -0.5980697 0.4918535 0.02023623 -0.1580072 0.2578998
##          [,44]     [,45]    [,46]      [,47]       [,48]    [,49]      [,50]
## [1,] 0.4574696 -1.058922 0.126543 -0.2223721 -0.01834812 0.438076 -0.7194657
##          [,51]     [,52]      [,53]     [,54]       [,55]      [,56]      [,57]
## [1,] 0.3673409 0.6158204 -0.5537395 0.3276649 0.006263418 0.02665544 -0.2796517
##           [,58]      [,59]      [,60]
## [1,] -0.7236042 -0.1991801 -0.1380917

The g-Inverse estimate

plot(betahat,col=2,ylim=c(0,3), pch = 1)
points(beta,col= 3, pch = 2)
legend(75,2,c("G-inv","True"),col=c(2,3),pch=c(1,2))

Enter Lasso

grid=10^seq(10,-2,length=100)
lasso.mod =glmnet(x,y,alpha =1,lambda =grid)
plot(lasso.mod)

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

Choosing Lambda

cv.out =cv.glmnet(x,y,alpha =1)
(bestlam =cv.out$lambda.min)

## [1] 0.08295634

plot(cv.out)

Sparse Solution

lasso.coef=predict(lasso.mod,type ="coefficients",s=bestlam)
lasso.coef[lasso.coef !=0]

## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient

## [1]  0.01459650  2.85670247  2.92947327  2.88166521  2.85101983  2.96229890
## [7] -0.01440246

Use which function to get rid of the warning message

lasso.coef[which(lasso.coef!=0)]

## [1]  0.01459650  2.85670247  2.92947327  2.88166521  2.85101983  2.96229890
## [7] -0.01440246

Lasso is close to the truth

plot(lasso.coef[-1],col="red", pch = 1, ylim = c(-0.5,3.5))
points(beta,col= rgb(0,0,1,0.5),pch = 16)
legend(75,2,c("Lasso","True"),col=c("red","blue"),pch=c(17,16))