glmnet
install.packages(c("mplot","ISLR","glmnet","MASS","readr","dplyr"))
This is a data set from the UCI Machine Learning Repository.
To make yourself familiar with the data, read an explanation.
We start by reading the data directly from the URL. Because the entries are separated by a comma, we specify sep=","
in the function read.table
or use the read_csv
function from the readr
package:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data"
dat = readr::read_csv(url,col_names = FALSE)
# View(dat)
dim(dat)
dat = dplyr::rename(dat, y = X35)
# names(dat)[35] = "y" # an alternative
dat$y = as.factor(dat$y)
Look at the summary statistics for each variable and their scatterplot matrix. Show that column V2
is a vector of zeros and should therefore be omitted from the analysis.
Delete the second column and make available a cleaned design matrix x
with \(p=33\) columns and a response vector y
of length \(n=351\), which is a factor with levels b
and g
.
x = as.matrix(dat[,-c(2,35)])
dim(x)
y = dat$y
is.factor(y)
table(y)
Check the pairwise correlations in the design matrix and show that only one pair (the 12th and 14th column of x
) of variables has an absolute Pearson correlation of more than 0.8.
Fit a logistic lasso regression and comment on the lasso coefficient plot (showing \(\log(\lambda)\) on the x-axis and showing labels for the variables).
Use cv.glmnet()
to get the two default choices lambda.min
and lambda.1se
for the lasso tuning parameter \(\lambda\) when the classification error is minimised. Calculate these two values on the log-scale as well to relate to the above lasso coefficient plot.
set.seed(1)
lasso.cv = cv.glmnet(x, y, family = "binomial", type.measure = "class")
c(lasso.cv$lambda.min,lasso.cv$lambda.1se)
round(log(c(lasso.cv$lambda.min,lasso.cv$lambda.1se)),2)
plot(lasso.cv)
Explore how sensitive the lambda.1se
value is to different choices of \(k\) in the \(k\)-fold CV (changing the default option nfolds=10
) as well as to different choices of the random seed.
Choose that \(\lambda\) value that gives smallest AIC and BIC for the models on the lasso path.
To calculate the \(-2\times\text{LogLik}\) for the models on the lasso path use the function deviance.glmnet
(check help(deviance.glmnet)
) and for calculating the number of non-zero regression parameters use the value df
of the glmnet
object.
Show also the corresponding regression coefficients.
Visualise the parameter estimates from the maximum-likelihood (ML), lasso, ridge and elastic-net methods. When obtaining the parameter estimates, use lambda.1se
for each of the three regularization methods. Recall that glm
fits logistic regression models with ML.
It is impossible to know for real data what classifier (regression model) will do best and often it is not even clear what is meant by ‘best’. We are going to use again the classification error, but this time of independent data to assess what is ‘best’ in terms of prediction.
To empirically determine this prediction error, a standard approach is to split the data into a training set (often around 2/3 of the observations) and into a validation set (the remaining observations).
Produce such a random split:
n = length(y)
n
set.seed(2015)
train = sample(1:n, 2*n/3) # by default replace=FALSE in sample()
length(train)/n
max(table(train)) # if 1, then no ties in the sampled observations
test = (1:n)[-train]
lambda.min
from cv.glmnet
.lasso.cv = cv.glmnet(x[train,], y[train], family="binomial", type.measure="class")
lasso.fit = glmnet(x[train,], y[train], family="binomial")
# calculate predicted probability to be 'g' - second level of categorical y - through type="response"
lasso.predict.prob = predict(lasso.fit, s=lasso.cv$lambda.min, newx=x[test,], type="response") # check help(predict.glmnet)
predict.g = lasso.predict.prob > 0.5
table(y[test],predict.g)
sum(table(y[test],predict.g)[c(2,3)])/length(test)
# calculate predicted y through type="class"
lasso.predict.y = predict(lasso.fit, s=lasso.cv$lambda.min,
newx=x[test,], type="class") # check help(predict.glmnet)
mean(lasso.predict.y!=y[test])
To explore the performance of the above regularization methods in normal linear regression when \(1000 = p > n = 300\) let us consider a true model \[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \quad\text{with}\quad \boldsymbol{\epsilon} \sim N(\boldsymbol{0},\boldsymbol{I}) \]
Generate the regression data
set.seed(300)
n = 300
p = 1000
p.nonzero = 10
x = matrix(rnorm(n*p),nrow=n,ncol=p)
b.nonzero = rexp(p.nonzero)*(sign(rnorm(p.nonzero)))
b.nonzero
beta = c(b.nonzero,rep(0,p-p.nonzero))
y = x%*%beta + rnorm(n)
Split the data into a 2/3 training and 1/3 test set as before.
train = sample(1:n,2*n/3)
x.train = x[train,]
x.test = x[-train,]
y.train = y[train]
y.test = y[-train]
lasso.fit = glmnet(x.train, y.train, family="gaussian", alpha=1)
ridge.fit = glmnet(x.train, y.train, family="gaussian", alpha=0)
enet.fit = glmnet(x.train, y.train, family="gaussian", alpha=.5)
mse
(mean squared error) from cv.glmnet
for 10-fold CV. Plot the solution paths and cross-validated MSE as function of \(\lambda\).fit = list()
for (i in 0:10) {
fit[[i+1]] = cv.glmnet(x.train, y.train, type.measure="mse", alpha=i/10,family="gaussian")
}
names(fit) = 0:10
par(mfrow=c(3,2))
plot(lasso.fit, xvar="lambda")
plot(fit[["10"]], main="lasso")
plot(ridge.fit, xvar="lambda")
plot(fit[["5"]], main="elastic-net")
plot(enet.fit, xvar="lambda")
plot(fit[["0"]], main="ridge")
yhat0 = predict(fit[["0"]], s=fit[["0"]]$lambda.1se, newx=x.test)
yhat1 = predict(fit[["1"]], s=fit[["1"]]$lambda.1se, newx=x.test)
yhat2 = predict(fit[["2"]], s=fit[["2"]]$lambda.1se, newx=x.test)
yhat3 = predict(fit[["3"]], s=fit[["3"]]$lambda.1se, newx=x.test)
yhat4 = predict(fit[["4"]], s=fit[["4"]]$lambda.1se, newx=x.test)
yhat5 = predict(fit[["5"]], s=fit[["5"]]$lambda.1se, newx=x.test)
yhat6 = predict(fit[["6"]], s=fit[["6"]]$lambda.1se, newx=x.test)
yhat7 = predict(fit[["7"]], s=fit[["7"]]$lambda.1se, newx=x.test)
yhat8 = predict(fit[["8"]], s=fit[["8"]]$lambda.1se, newx=x.test)
yhat9 = predict(fit[["9"]], s=fit[["9"]]$lambda.1se, newx=x.test)
yhat10 = predict(fit[["10"]], s=fit[["10"]]$lambda.1se, newx=x.test)
mse0 = mean((y.test - yhat0)^2)
mse1 = mean((y.test - yhat1)^2)
mse2 = mean((y.test - yhat2)^2)
mse3 = mean((y.test - yhat3)^2)
mse4 = mean((y.test - yhat4)^2)
mse5 = mean((y.test - yhat5)^2)
mse6 = mean((y.test - yhat6)^2)
mse7 = mean((y.test - yhat7)^2)
mse8 = mean((y.test - yhat8)^2)
mse9 = mean((y.test - yhat9)^2)
mse10 = mean((y.test - yhat10)^2)
c(mse0,mse1,mse2,mse3,mse4,mse5,mse6,mse7,mse8,mse9,mse10)
Here, lasso has smallest MSE and we note that lasso seems to do well when there are lots of redundant variables.
Generate the regression data
set.seed(300)
n = 300
p = 1000
p.nonzero = n
x = matrix(rnorm(n*p),nrow=n,ncol=p)
b.nonzero = rexp(p.nonzero)*(sign(rnorm(p.nonzero)))
summary(b.nonzero)
beta = c(b.nonzero,rep(0,p-p.nonzero))
y = x%*%beta + rnorm(n)
Split the data into a 2/3 training and 1/3 test set as before.
train = sample(1:n,2*n/3)
x.train = x[train,]
x.test = x[-train,]
y.train = y[train]
y.test = y[-train]
Fit the lasso, elastic-net (with \(\alpha=0.5\)) and ridge regression.
Write a loop, varying \(\alpha\) from \(0,0.1,\ldots 1\) and extract mse
(mean squared error) from cv.glmnet
for 10-fold CV.
Plot the solution paths and cross-validated MSE as function of \(\lambda\).
Calculate the MSE on the test set.
Here, elastic-net has smallest MSE but we note that the MSE is much more variable (as \(\alpha\) is varied).
Investigate effect of correlated columns in \(\boldsymbol{X}\), for example \(\text{Cov}(\boldsymbol{X})_{ij}=(0.9)^{|i−j|}\).
n = 100
p = 60
CovMatrix = outer(1:p, 1:p, function(x,y) {.9^abs(x-y)})
x = MASS::mvrnorm(n, rep(0,p), CovMatrix)
Similar setupas above but different ways to generate non-zero regression coefficients