Learning outcomes

  1. Statistical models
    • Assumptions
    • Possible models, model space
    • What makes a good model
  2. “Classical” model selection methods
    • Residual Sum of Squares
    • \(R^2\) and adjusted \(R^2\)
    • Information criterion (e.g. AIC, BIC)
  3. Stepwise search schemes
    • Sequential testing
    • Backward stepwise methods

Required R packages:

install.packages(c("mplot","pairsD3","Hmisc","d3heatmap","mfp","dplyr"))

Diabetes example

Efron et al. (2004) introduced the diabetes data set with 442 observations and 11 variables. It is often used as an examplar data set to illustrate new model selection techniques. The following commands will help you get a feel for the data.

data("diabetes", package="mplot")
# help("diabetes", package="mplot")
str(diabetes) # structure of the diabetes
pairs(diabetes) # traditional pairs plot
boxplot(diabetes) # always a good idea to check for gross outliers
pairsD3::shinypairs(diabetes) # interactive pairs plot of the data set
d3heatmap::d3heatmap(cor(diabetes))
Hmisc::describe(diabetes,digits=1) # summary of the diabetes data

Stepwise model selection

Try to find a couple of good models using the techinques discussed in lectures.

Backwards

M0 = lm(y ~ 1, data = diabetes)  # Null model
M1 = lm(y ~ ., data = diabetes)  # Full model
summary(M1)
dim(diabetes)

Try doing backward selecton using AIC first, then using BIC. Do you get the same result?

Forwards

Explore the forwards selection technique, which works very similarly to backwards selection, just set direction = "forward" in the step() function. We will discuss forward selection in more detail in lecture 2. When using direction = "forward" you need to sepecify a scope parameter: scope = list(lower = M0, upper = M1).

The add1 command

Try using the add1() function. The general form is add1(fitted.model, test = "F", scope = M1)

Body fat

Your turn to explore with the body fat data set. Using the techniques outlined in the first lecture identify one (or more) models that help explain the body fat percentage of an individual.

data("bodyfat", package = "mplot")
# help("bodyfat", package = "mplot")
dim(bodyfat)
names(bodyfat)
bfat = bodyfat[,-1] # remove the ID column

Both forward and backward stepwise select a model with Fore, Neck, Weight and Abdo. It does look to be substantially better than a simple linear regression of Bodyfat on Abdo (the best simple linear regression model).

Advanced: Leave-one-out cross-validation

The leave-one-out cross-validation statistic (or prediction residual sum of squares - PRESS) is, \[CV = \frac{1}{n}\sum_{i=1}^{n}e_{[i]}^2,\] where \(e_{[i]} = y_i - \hat{y}_{[i]}\) for \(i=1,\ldots,n\) and \(\hat{y}_{[i]}\) is the predicted value obtained when the model is estimated after deleting the \(i\)th observation.

Program your own leave-one-out CV function and use it to evaluate between competing models from the body fat or diabetes examples. Hint: you might want to use the update() function: refit = update(fit, subset = index != i) where index and i are as defined in the outline of a function below.

loocv = function(fit){
  n = length(fit$residuals)
  yvar = fit$model[,1]
  index = 1:n
  e = rep(NA, n)
  for (i in index) {
    refit = 
    pred = 
    e[i] = 
  }
  return(mean(e^2))
}

Diabetes

Compare between the selected stepwise model and a simpler model simple.model = lm(y~bmi+ltg+map, data = diabetes). We’ll see justification for this simpler model in lab 4.

simple.model = lm(y~bmi+ltg+map, data = diabetes)
step.model = lm(y~bmi+ltg+map+tc+sex+ldl, data = diabetes)
loocv(step.model)
loocv(simple.model)
summary(step.model)$adj.r.square
summary(simple.model)$adj.r.square

Body fat

Compare between the chosen stepwise model and the simple linear regression of Bodyfat on Abdo.

Optional extra: Confirm (numerically) that \[CV = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{e_i}{1-h_i}\right)^2,\] where \(e_i\) are the residuals from the model estimated with the full data set and \(h_i\) are the diagonal elements of the hat matrix for the full data set, \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\). The implication here is that you do not need to estimate all \(n\) submodels. Hint: lm.influence(fit)$h extracts the diagonal of the hat matrix and residuals(fit) extracts the residuals for the lm object fit.

loocv.h = function(fit) {
  h = 
  e = 
  return(            )
}

Prediction error

The body fat data included in the mplot package is a subset of a larger data set, which can be found in the mfp package. You could consider the subset in the mplot package to be a training set.

data("bodyfat",package="mplot")
head(bodyfat)
nms = names(bodyfat)
training.id = bodyfat$Id
# install.pacakges("mfp")
data("bodyfat",package="mfp")
names(bodyfat)
# there were two additional measures of body fat, remove these:
bfat.full = subset(bodyfat, select = -c(brozek,density))
bfat.full[training.id[1:6],]
# ensure common naming conventions:
colnames(bfat.full) = nms
training = is.element(bfat.full$Id,training.id)
bfat.train = bfat.full[training,-1]
bfat.validate = bfat.full[!training,-1]
dim(bfat.train)
dim(bfat.validate)
boxplot(bfat.validate,horizontal = TRUE,las=1)
# need to reestimate the models on the training data set
# only because weight is in lbs no kg
slr.bf = lm(Bodyfat ~ Abdo, data = bfat.train)
step.bf = lm(Bodyfat ~ Abdo + Fore + Neck + Weight, data = bfat.train)

Evaluate the performance of the various models identified earlier against the validation data (hint: use the predict() function).

Note that there are some unusual observations in the validation data set bfat.validate. You might consider identifying and removing those for the purposes of model selection.

The stepwise model still performs better than the simple linear regression, though the mean square prediction error of the simple linear regression is substantially reduced when the unusual observations are removed from the validation training set.

References