Lab A: Introduction to model selection

Learning outcomes

Stepwise methods
Exhaustive searches
Regularisation procedures
Marginality

Required R packages

install.packages(c("mplot", "lmSubsets", "bestglm", "car", "ISLR", "glmnet", "MASS", "readr", "dplyr", "Hmisc","skimr", "mfp"))

Stepwise methods

Diabetes

Efron et al. (2004) introduced the diabetes data set with 442 observations and 11 variables. It is often used as an examplar data set to illustrate new model selection techniques. The following commands will help you get a feel for the data.

data("diabetes", package="mplot")
# help("diabetes", package="mplot")
str(diabetes) # structure of the diabetes
pairs(diabetes) # traditional pairs plot
boxplot(diabetes) # always a good idea to check for gross outliers

cor(diabetes)
Hmisc::describe(diabetes, digits=1) # summary of the diabetes data
skimr::skim(diabetes) # alternative summary method

Remark: Hmisc::describe returns Gmd, which is Gini’s mean difference, a robust measure of dispersion that is the mean absolute difference between any pairs of observations (see here for more information)

Try to find a couple of good models using the stepwise techniques discussed in lectures.

Backwards

M0 = lm(y ~ 1, data = diabetes)  # Null model
M1 = lm(y ~ ., data = diabetes)  # Full model
summary(M1)
dim(diabetes)

Try doing backward selection using AIC first, then using BIC. Do you get the same result?

Hint: Specify the option direction and k in the function step() correctly.

Forwards

Explore the forwards selection technique, which works very similarly to backwards selection, just set direction = "forward" in the step() function. When using direction = "forward" you need to sepecify a scope parameter: scope = list(lower = M0, upper = M1).

The `add1` and `drop1` command

Try using the add1() function. The general form is add1(fitted.model, test = "F", scope = M1).

There is also the drop1() function, what does it return?

Body fat

Your turn: explore with the body fat data set. Using the techniques outlined in the first lecture identify one (or more) models that help explain the body fat percentage of an individual.

data("bodyfat", package = "mplot")
# help("bodyfat", package = "mplot")
dim(bodyfat)
names(bodyfat)
bfat = bodyfat[,-1] # remove the ID column

Exhaustive searches

Crime data

data("UScrime", package = "MASS")
n = nrow(UScrime)
M0 = lm(y ~ 1, data=UScrime) # Null model
M1 = lm(y ~ ., data=UScrime) # Full model

Use the lmSubsets() function from the lmSubsets package to obtain the 50 best models for each model size and use the plot() function to plot their RSS (and BIC) against model dimension.
Obtain and compare the best AIC and BIC models using the lmSelect() function with forward and backwards stepwise procedures.

Regularisation procedures

Ionosphere data

This is a data set from the UCI Machine Learning Repository. To make yourself familiar with the data, read an explanation.

We start by reading the data directly from the URL. There are many ways to read a file into R. For example:

Because the entries are separated by a comma, we specify sep="," in the function read.table()
Use read.csv()
Use the read_csv function from the readr package which is faster than read.csv()
Use the new vroom package, which automatically detects the separator (e.g. tab or comma), and is blindingly fast (vroom vroom).

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data"
dat = readr::read_csv(url, col_names = FALSE)
# install.packages("vroom")
# dat = vroom::vroom(url, col_names = FALSE)
# View(dat)
dim(dat)
dat = dplyr::rename(dat, y = X35)
# names(dat)[35] = "y" # an alternative
dat$y = as.factor(dat$y)

Look at the summary statistics for each variable and their scatterplot matrix. Show that column X2 is a vector of zeros and should therefore be omitted from the analysis.

Delete the second column and make available a cleaned design matrix x with \(p=33\) columns and a response vector y of length \(n=351\), which is a factor with levels b and g.

x = as.matrix(dat[,-c(2,35)])
dim(x)
y = dat$y
is.factor(y)
table(y)

Check the pairwise correlations in the design matrix and show that only one pair (the 12th and 14th column of x) of variables has an absolute Pearson correlation of more than 0.8.

Logistic lasso regression

Fit a logistic lasso regression and comment on the lasso coefficient plot (showing \(\log(\lambda)\) on the x-axis and showing labels for the variables).

Choice of \(\lambda\) through AIC and BIC

Choose that \(\lambda\) value that gives smallest AIC and BIC for the models on the lasso path.

To calculate the \(-2\times\text{LogLik}\) for the models on the lasso path use the function deviance.glmnet (check help(deviance.glmnet)) and for calculating the number of non-zero regression parameters use the value df of the glmnet object.

Show also the corresponding regression coefficients.

Choice of \(\lambda\) through CV

Use cv.glmnet() to get the two default choices lambda.min and lambda.1se for the lasso tuning parameter \(\lambda\) when the classification error is minimised. Calculate these two values on the log-scale as well to relate to the above lasso coefficient plot.

set.seed(1)
lasso.cv = cv.glmnet(x, y, family = "binomial", type.measure = "class")
c(lasso.cv$lambda.min,lasso.cv$lambda.1se)
round(log(c(lasso.cv$lambda.min,lasso.cv$lambda.1se)),2)
plot(lasso.cv)

Explore how sensitive the lambda.1se value is to different choices of \(k\) in the \(k\)-fold CV (changing the default option nfolds=10) as well as to different choices of the random seed. (For each lambda value, the SE is estimated from the prediction errors over each of the \(k\) CV folds. The 1SE lambda value corresponds to the largest tuning parameter that has an estimated mean prediction error within the 1SE bounds of the lambda value that minimised the prediction error. For further details see Fig 2, as well as here and here.)

Marginality

In the lecture we briefly discussed the hierNet (pronounced “hair net”) package. Other packages that implement similar algorithms include:

These packages tend to be not particularly user friendly. Here we will explore RAMP and hierNet.

Using the same example from the lecture, we can compare the RAMP and heirNet results.

data("wallabies", package = "mplot")
X = subset(wallabies, select = c(-rw,-lat,-long))
y = wallabies$rw
Xy = as.data.frame(cbind(X, y))
Xm = X[, 1:5]  # main effects only
Xm = scale(Xm, center = TRUE, scale = TRUE)
# heirNet (from lectures)
fit = hierNet.logistic.path(Xm, y, diagonal=FALSE, 
                            strong=TRUE) 
set.seed(9)
fitcv=hierNet.cv(fit,Xm,y,trace=0)
fit.1se = hierNet.logistic(Xm, y, lam=fitcv$lamhat.1se,
                           diagonal = FALSE, strong = TRUE)
print(fit.1se)
# RAMP (equivalent RAMP code)
ramp = RAMP(Xm, y, family = "binomial", hier = "Strong")
ramp

Diabetes

Your turn: Returning to the diabetes example, use both the RAMP and the hierNet package to determine a good model from the set of variables that includes all pairwise interactions.

References

Efron B, Hastie T, Johnstone I and Tibshirani R (2004). Least angle regression. The Annals of Statistics 32(2): 407-499. doi:10.1214/009053604000000067
Johnson RW (1996). Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education (4)1, http://www.amstat.org/publications/jse/v4n1/datasets.johnson.html

Lab A: Introduction to model selection

Samuel Müller and Garth Tarr

Learning outcomes

Required R packages

Stepwise methods

Diabetes

Backwards

Forwards

The add1 and drop1 command

Body fat

Exhaustive searches

Crime data

Regularisation procedures

Ionosphere data

Logistic lasso regression

Choice of \(\lambda\) through AIC and BIC

Choice of \(\lambda\) through CV

Marginality

Diabetes

References

The `add1` and `drop1` command