Learning outcomes

  1. Regularisation with glmnet
  2. Lasso regression
  3. Ridge regression
  4. Elastic-net
  5. Cross-validation for model selection
  6. Selecting \(\lambda\)

Required R packages:

install.packages(c("mplot","ISLR","glmnet","MASS","readr","dplyr"))

Ionosphere data

This is a data set from the UCI Machine Learning Repository.

Get to know and clean the data

To make yourself familiar with the data, read an explanation.

We start by reading the data directly from the URL. Because the entries are separated by a comma, we specify sep="," in the function read.table or use the read_csv function from the readr package:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data"
dat = readr::read_csv(url,col_names = FALSE)
# View(dat)
dim(dat)
dat = dplyr::rename(dat, y = X35)
# names(dat)[35] = "y" # an alternative
dat$y = as.factor(dat$y)

Look at the summary statistics for each variable and their scatterplot matrix. Show that column V2 is a vector of zeros and should therefore be omitted from the analysis.

Delete the second column and make available a cleaned design matrix x with \(p=33\) columns and a response vector y of length \(n=351\), which is a factor with levels b and g.

x = as.matrix(dat[,-c(2,35)])
dim(x)
y = dat$y
is.factor(y)
table(y)

Check the pairwise correlations in the design matrix and show that only one pair (the 12th and 14th column of x) of variables has an absolute Pearson correlation of more than 0.8.

Logistic lasso regression

Fit a logistic lasso regression and comment on the lasso coefficient plot (showing \(\log(\lambda)\) on the x-axis and showing labels for the variables).

Choice of \(\lambda\) through CV

Use cv.glmnet() to get the two default choices lambda.min and lambda.1se for the lasso tuning parameter \(\lambda\) when the classification error is minimised. Calculate these two values on the log-scale as well to relate to the above lasso coefficient plot.

set.seed(1)
lasso.cv = cv.glmnet(x, y, family = "binomial", type.measure = "class")
c(lasso.cv$lambda.min,lasso.cv$lambda.1se)
round(log(c(lasso.cv$lambda.min,lasso.cv$lambda.1se)),2)
plot(lasso.cv)

Explore how sensitive the lambda.1se value is to different choices of \(k\) in the \(k\)-fold CV (changing the default option nfolds=10) as well as to different choices of the random seed.

Choice of \(\lambda\) through AIC and BIC

Choose that \(\lambda\) value that gives smallest AIC and BIC for the models on the lasso path.

To calculate the \(-2\times\text{LogLik}\) for the models on the lasso path use the function deviance.glmnet (check help(deviance.glmnet)) and for calculating the number of non-zero regression parameters use the value df of the glmnet object.

Show also the corresponding regression coefficients.

Maximum-likelihood, ridge, lasso and elastic-net

Visualise the parameter estimates from the maximum-likelihood (ML), lasso, ridge and elastic-net methods. When obtaining the parameter estimates, use lambda.1se for each of the three regularization methods. Recall that glm fits logistic regression models with ML.

Which is better, lasso, ridge or elastic-net?

It is impossible to know for real data what classifier (regression model) will do best and often it is not even clear what is meant by ‘best’. We are going to use again the classification error, but this time of independent data to assess what is ‘best’ in terms of prediction.

To empirically determine this prediction error, a standard approach is to split the data into a training set (often around 2/3 of the observations) and into a validation set (the remaining observations).

Produce such a random split:

n = length(y)
n
set.seed(2015)
train = sample(1:n, 2*n/3) # by default replace=FALSE in sample()
length(train)/n
max(table(train)) # if 1, then no ties in the sampled observations
test = (1:n)[-train]
  • Calculate the validation classification error rate for the lasso when the logistic regression model is based on the training sample and \(\lambda\) is selected through lambda.min from cv.glmnet.
lasso.cv = cv.glmnet(x[train,], y[train], family="binomial", type.measure="class")
lasso.fit = glmnet(x[train,], y[train], family="binomial")
# calculate predicted probability to be 'g' - second level of categorical y - through type="response"
lasso.predict.prob = predict(lasso.fit, s=lasso.cv$lambda.min, newx=x[test,], type="response") # check help(predict.glmnet) 
predict.g = lasso.predict.prob > 0.5
table(y[test],predict.g)
sum(table(y[test],predict.g)[c(2,3)])/length(test)
# calculate predicted y through type="class"
lasso.predict.y = predict(lasso.fit, s=lasso.cv$lambda.min,
newx=x[test,], type="class") # check help(predict.glmnet) 
mean(lasso.predict.y!=y[test])
  • Calculate the validation classification error rate for the elastic-net (with \(\alpha=0.5\)) and ridge regression as well.

Simulation study

To explore the performance of the above regularization methods in normal linear regression when \(1000 = p > n = 300\) let us consider a true model \[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad\text{with}\quad \boldsymbol{\epsilon} \sim N(\boldsymbol{0},\boldsymbol{I}) \]

Setting 1

  • Sparse signal, 1% non-zero regression parameters
  • Predictors have small empirical correlations

Generate the regression data

set.seed(300)
n = 300
p = 1000
p.nonzero = 10
x = matrix(rnorm(n*p),nrow=n,ncol=p)
b.nonzero = rexp(p.nonzero)*(sign(rnorm(p.nonzero)))
b.nonzero
beta = c(b.nonzero,rep(0,p-p.nonzero))
y = x%*%beta + rnorm(n)

Split the data into a 2/3 training and 1/3 test set as before.

train = sample(1:n,2*n/3)
x.train = x[train,]
x.test  = x[-train,]
y.train = y[train]
y.test  = y[-train]
  • Fit the lasso, elastic-net (with \(\alpha=0.5\)) and ridge regression.
lasso.fit = glmnet(x.train, y.train, family="gaussian", alpha=1)
ridge.fit = glmnet(x.train, y.train, family="gaussian", alpha=0)
enet.fit  = glmnet(x.train, y.train, family="gaussian", alpha=.5)
  • Write a loop, varying \(\alpha\) from \(0,0.1,\ldots 1\) and extract mse (mean squared error) from cv.glmnet for 10-fold CV. Plot the solution paths and cross-validated MSE as function of \(\lambda\).
fit = list()
for (i in 0:10) {
    fit[[i+1]] = cv.glmnet(x.train, y.train, type.measure="mse", alpha=i/10,family="gaussian")
}
names(fit) = 0:10

par(mfrow=c(3,2))
plot(lasso.fit, xvar="lambda")
plot(fit[["10"]], main="lasso")

plot(ridge.fit, xvar="lambda")
plot(fit[["5"]], main="elastic-net")

plot(enet.fit, xvar="lambda")
plot(fit[["0"]], main="ridge")
  • Calculate the MSE on the test set.
yhat0  = predict(fit[["0"]], s=fit[["0"]]$lambda.1se, newx=x.test)
yhat1  = predict(fit[["1"]], s=fit[["1"]]$lambda.1se, newx=x.test)
yhat2  = predict(fit[["2"]], s=fit[["2"]]$lambda.1se, newx=x.test)
yhat3  = predict(fit[["3"]], s=fit[["3"]]$lambda.1se, newx=x.test)
yhat4  = predict(fit[["4"]], s=fit[["4"]]$lambda.1se, newx=x.test)
yhat5  = predict(fit[["5"]], s=fit[["5"]]$lambda.1se, newx=x.test)
yhat6  = predict(fit[["6"]], s=fit[["6"]]$lambda.1se, newx=x.test)
yhat7  = predict(fit[["7"]], s=fit[["7"]]$lambda.1se, newx=x.test)
yhat8  = predict(fit[["8"]], s=fit[["8"]]$lambda.1se, newx=x.test)
yhat9  = predict(fit[["9"]], s=fit[["9"]]$lambda.1se, newx=x.test)
yhat10 = predict(fit[["10"]], s=fit[["10"]]$lambda.1se, newx=x.test)

mse0  = mean((y.test - yhat0)^2)
mse1  = mean((y.test - yhat1)^2)
mse2  = mean((y.test - yhat2)^2)
mse3  = mean((y.test - yhat3)^2)
mse4  = mean((y.test - yhat4)^2)
mse5  = mean((y.test - yhat5)^2)
mse6  = mean((y.test - yhat6)^2)
mse7  = mean((y.test - yhat7)^2)
mse8  = mean((y.test - yhat8)^2)
mse9  = mean((y.test - yhat9)^2)
mse10 = mean((y.test - yhat10)^2)
c(mse0,mse1,mse2,mse3,mse4,mse5,mse6,mse7,mse8,mse9,mse10)

Here, lasso has smallest MSE and we note that lasso seems to do well when there are lots of redundant variables.

Setting 2

  • Non-sparse signal, \(n/p\) non-zero regression parameters
  • Predictors have small empirical correlations

Generate the regression data

set.seed(300)
n = 300
p = 1000
p.nonzero = n
x = matrix(rnorm(n*p),nrow=n,ncol=p)
b.nonzero = rexp(p.nonzero)*(sign(rnorm(p.nonzero)))
summary(b.nonzero)
beta = c(b.nonzero,rep(0,p-p.nonzero))
y = x%*%beta + rnorm(n)

Split the data into a 2/3 training and 1/3 test set as before.

train = sample(1:n,2*n/3)
x.train = x[train,]
x.test  = x[-train,]
y.train = y[train]
y.test  = y[-train]
  • Fit the lasso, elastic-net (with \(\alpha=0.5\)) and ridge regression.

  • Write a loop, varying \(\alpha\) from \(0,0.1,\ldots 1\) and extract mse (mean squared error) from cv.glmnet for 10-fold CV.

  • Plot the solution paths and cross-validated MSE as function of \(\lambda\).

  • Calculate the MSE on the test set.

Here, elastic-net has smallest MSE but we note that the MSE is much more variable (as \(\alpha\) is varied).

Additional settings

Investigate effect of correlated columns in \(\boldsymbol{X}\), for example \(\text{Cov}(\boldsymbol{X})_{ij}=(0.9)^{|i−j|}\).

n = 100
p = 60  
CovMatrix = outer(1:p, 1:p, function(x,y) {.9^abs(x-y)})
x = MASS::mvrnorm(n, rep(0,p), CovMatrix)

Similar setupas above but different ways to generate non-zero regression coefficients

  • All are equal to 1
  • Some are large (say 4), intermediate (1) and small (0.25)