install.packages(c("mplot", "lmSubsets", "bestglm", "car", "ISLR", "glmnet", "MASS", "readr", "dplyr", "Hmisc","skimr", "mfp"))
Efron et al. (2004) introduced the diabetes data set with 442 observations and 11 variables. It is often used as an examplar data set to illustrate new model selection techniques. The following commands will help you get a feel for the data.
data("diabetes", package="mplot")
# help("diabetes", package="mplot")
str(diabetes) # structure of the diabetes
pairs(diabetes) # traditional pairs plot
boxplot(diabetes) # always a good idea to check for gross outliers
cor(diabetes)
Hmisc::describe(diabetes, digits=1) # summary of the diabetes data
skimr::skim(diabetes) # alternative summary method
Remark: Hmisc::describe
returns Gmd
, which is Gini’s mean difference, a robust measure of dispersion that is the mean absolute difference between any pairs of observations (see here for more information)
Try to find a couple of good models using the stepwise techniques discussed in lectures.
M0 = lm(y ~ 1, data = diabetes) # Null model
M1 = lm(y ~ ., data = diabetes) # Full model
summary(M1)
dim(diabetes)
Try doing backward selection using AIC first, then using BIC. Do you get the same result?
Hint: Specify the option direction
and k
in the function step()
correctly.
Explore the forwards selection technique, which works very similarly to backwards selection, just set direction = "forward"
in the step()
function. When using direction = "forward"
you need to sepecify a scope parameter: scope = list(lower = M0, upper = M1)
.
add1
and drop1
commandTry using the add1()
function. The general form is add1(fitted.model, test = "F", scope = M1)
.
There is also the drop1()
function, what does it return?
Your turn: explore with the body fat data set. Using the techniques outlined in the first lecture identify one (or more) models that help explain the body fat percentage of an individual.
data("bodyfat", package = "mplot")
# help("bodyfat", package = "mplot")
dim(bodyfat)
names(bodyfat)
bfat = bodyfat[,-1] # remove the ID column
data("UScrime", package = "MASS")
n = nrow(UScrime)
M0 = lm(y ~ 1, data=UScrime) # Null model
M1 = lm(y ~ ., data=UScrime) # Full model
Use the lmSubsets()
function from the lmSubsets package to obtain the 50 best models for each model size and use the plot()
function to plot their RSS (and BIC) against model dimension.
Obtain and compare the best AIC and BIC models using the lmSelect()
function with forward and backwards stepwise procedures.
This is a data set from the UCI Machine Learning Repository. To make yourself familiar with the data, read an explanation.
We start by reading the data directly from the URL. There are many ways to read a file into R. For example:
sep=","
in the function read.table()
read.csv()
read_csv
function from the readr package which is faster than read.csv()
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data"
dat = readr::read_csv(url, col_names = FALSE)
# install.packages("vroom")
# dat = vroom::vroom(url, col_names = FALSE)
# View(dat)
dim(dat)
dat = dplyr::rename(dat, y = X35)
# names(dat)[35] = "y" # an alternative
dat$y = as.factor(dat$y)
Look at the summary statistics for each variable and their scatterplot matrix. Show that column X2
is a vector of zeros and should therefore be omitted from the analysis.
Delete the second column and make available a cleaned design matrix x
with \(p=33\) columns and a response vector y
of length \(n=351\), which is a factor with levels b
and g
.
x = as.matrix(dat[,-c(2,35)])
dim(x)
y = dat$y
is.factor(y)
table(y)
Check the pairwise correlations in the design matrix and show that only one pair (the 12th and 14th column of x
) of variables has an absolute Pearson correlation of more than 0.8.
Fit a logistic lasso regression and comment on the lasso coefficient plot (showing \(\log(\lambda)\) on the x-axis and showing labels for the variables).
Choose that \(\lambda\) value that gives smallest AIC and BIC for the models on the lasso path.
To calculate the \(-2\times\text{LogLik}\) for the models on the lasso path use the function deviance.glmnet
(check help(deviance.glmnet)
) and for calculating the number of non-zero regression parameters use the value df
of the glmnet
object.
Show also the corresponding regression coefficients.
Use cv.glmnet()
to get the two default choices lambda.min
and lambda.1se
for the lasso tuning parameter \(\lambda\) when the classification error is minimised. Calculate these two values on the log-scale as well to relate to the above lasso coefficient plot.
set.seed(1)
lasso.cv = cv.glmnet(x, y, family = "binomial", type.measure = "class")
c(lasso.cv$lambda.min,lasso.cv$lambda.1se)
round(log(c(lasso.cv$lambda.min,lasso.cv$lambda.1se)),2)
plot(lasso.cv)
Explore how sensitive the lambda.1se
value is to different choices of \(k\) in the \(k\)-fold CV (changing the default option nfolds=10
) as well as to different choices of the random seed. (For each lambda value, the SE is estimated from the prediction errors over each of the \(k\) CV folds. The 1SE lambda value corresponds to the largest tuning parameter that has an estimated mean prediction error within the 1SE bounds of the lambda value that minimised the prediction error. For further details see Fig 2, as well as here and here.)
In the lecture we briefly discussed the hierNet (pronounced “hair net”) package. Other packages that implement similar algorithms include:
These packages tend to be not particularly user friendly. Here we will explore RAMP and hierNet.
Using the same example from the lecture, we can compare the RAMP and heirNet results.
data("wallabies", package = "mplot")
X = subset(wallabies, select = c(-rw,-lat,-long))
y = wallabies$rw
Xy = as.data.frame(cbind(X, y))
Xm = X[, 1:5] # main effects only
Xm = scale(Xm, center = TRUE, scale = TRUE)
# heirNet (from lectures)
fit = hierNet.logistic.path(Xm, y, diagonal=FALSE,
strong=TRUE)
set.seed(9)
fitcv=hierNet.cv(fit,Xm,y,trace=0)
fit.1se = hierNet.logistic(Xm, y, lam=fitcv$lamhat.1se,
diagonal = FALSE, strong = TRUE)
print(fit.1se)
# RAMP (equivalent RAMP code)
ramp = RAMP(Xm, y, family = "binomial", hier = "Strong")
ramp
Your turn: Returning to the diabetes example, use both the RAMP and the hierNet package to determine a good model from the set of variables that includes all pairwise interactions.