Learning outcomes

  1. Cross validation with the caret package
  2. The mplot package
  3. Variable inclusion plots
  4. Model stability plots
  5. Bootstraping the lasso

Required R packages:

install.packages(c("mplot","pairsD3","Hmisc","caret"))

Body fat data

In lab A we found that both forward and backward stepwise methods selected the same model Fore, Neck, Weight and Abdo. An F test “confirmed” that this was better than a simple linear regression of Bodyfat on Abdo, however this was all done “in-sample”. It may be of interest to compare these two models on their out-of-sample predictive ability using cross validation with the caret package.

library(caret)
data("bodyfat", package = "mplot")
# help("bodyfat", package = "mplot")
names(bodyfat)
bfat = bodyfat[,-1] # remove the ID column
# specify how we want the cross validation to run
set.seed(2)
control = trainControl(method = "cv", number = 10)
lm4_cv = train(Bodyfat ~ Fore + Neck + Weight + Abdo, method = "lm",
               data = bfat, trControl = control)
lm4_cv
lm1_cv = train(Bodyfat ~ Abdo, method = "lm",
               data = bfat, trControl = control)
lm1_cv
results <- resamples(list(lm4 = lm4_cv, lm1 = lm1_cv))
summary(results)
library(ggplot2)
ggplot(results) + theme_classic() + labs(y = "MAE")
ggplot(results, metric = "RMSE") + 
  theme_classic() + 
  labs(y = "RMSE")
ggplot(results, metric = "Rsquared") + 
  theme_classic() + 
  labs(y = "R-squared")

Your turn: Repeat the above, but perform repeated 5-fold cross validation with 10 repeats. Hint: in the trainControl() function use the repeats argument.

For further information about the caret package see here.

Diabetes data

Start exploring the diabetes example through the demonstrator version of the shiny interface that can be obtained through mplot() either by loading http://garthtarr.com/apps/mplot/ or by using the code below.

library(mplot)
load(url("http://garthtarr.com/dbeg2019.RData"))
# makes available objects full.mod, af1,v1 and bgn1
mplot(full.mod,af1,v1,bgn1)
# mplot takes as an input a full model in its first slot
# and then one or more objects of type vis, af or bglmnet

If there is no left black sidebar, click in the blue heading on the menu/sidebar icon.

1. Explore visually the ‘Variable inclusion’ plot and do the following:

  1. Hover over the green colored hdl curve in the right ‘label panel’: the entire curve is rendered in bold.
    • Why does this hdl curve have a different behaviour (non-monotone decreasing) to the other curves (who are monotone decreasing)?
  2. Hover over the purple colored RV curve in the right panel: This curve shows an example-curve for a variable that is known to be redundant (alternative expressions are not-important or not-predictive).
    • What do you conclude for curves (variables) that are close to this mint colored RV curve?
  3. Confirm that the bootstrapped/empirical probability of selecting tc is 0.87 (2dp) when a AIC penalty of \(\lambda=2\) is selected and that the bootstrapped probability of selection decreases to 0.59 (2dp) if a BIC penalty, ie \(\lambda=\log(n)\), is selected instead.
  4. Render a non-interactive version of the variable inclusion plot by clicking ‘No’ under the ‘Interactive plots’ heading in the left sidebar.

2. Explore the ‘Bootstrap glmnet’ tab next (for the diabetes example, we can easily do an exhaustive variable inclusion plot as seen above; however, in general, this may not be feasible, so bootstrapping glmnet will be more feasible):

  1. Confirm that hdl has again a different behaviour (as seen above) to the other curves (although ldl is also somewhat different: both hdl and ldl are cholesterol measurements https://en.wikipedia.org/wiki/Cholesterol) .
  2. Confirm that the ordering of the variables according to their ‘importance’, estimated through the bootstrapped probability, is very similar for this plot as for the variable inclusion plot before.
  3. Confirm that four variables have a bootstrapped probability that is larger than 50% when the penalty parameter is 10. Confirm that this is the same number of variables in the variable inclusion plot that have a bootstrapped probability exceeding 50% when \(\lambda=10\).

Revisit Crime data from Lab A: explore subtractive measures

Read in the dataset UScrime from the MASS package and fit a full model (called M1). Note that you can access the t-statistic values by returning the third column from M1.sum$coefficients where M1.sum = summary(M1)

data("UScrime", package = "MASS")
M1 = lm(y ~ .,data=UScrime) # Full model
M1.sum = summary(M1)
M1.sum$coefficients
M1.sum$coefficients[,3]

Note that the regression coefficient for Prob is very large but also its corresponding standard error is very large; additionally, the standard errors vary considerably from variable to variable.

Copy and paste the sstab function from Lecture B, then modify it to generate \(B=200\) weighted bootstrap t statistics, that is standardizing estimated regression coefficients by dividing this with their standard error, store this matrix in an object called sstab.out.

Calculate the absolute values of the bootstrapped t-statistics, delete the first column (the values for the intercept) and store it in an object called sj.

Calculate the mean rank of the absolute t-values (copy and paste from the lecture) and note that Ineq and Ed have highest rank.

Produce a variable inclusion plot and a bootstrapped loss versus dimension plot, can you confirm the above observation?

Solid waste data of Gunst and Mason (1980)

This is a famous design matrix, often used to test variable selection methods. Here \(n=40\) and the number of slopes (explanatory variables) is \(k=4\), the total number of regression parameters is therefore \(p=5\). The design matrix is available in the bestglm package.

data("Shao", package = "bestglm")
X = as.matrix.data.frame(Shao)

Confirm that the number of rows and columns are 40 and 4, respectively using the function dim (alternatively you could use the functions ncol or nrow).

Return the first few rows of the design matrix \(X\) using the command head and note that it has not yet a column of ones representing the intercept.

In the literature, the intercept is often chosen to be \(\beta_0 = 2\) and four data generating slope vectors are frequently used for this data: \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((0,0,4,8)\), \((9,0,4,8)\) or \((9,6,4,8)\).

We continue with looking at the smaller of the two intermediate cases, choosing \[\boldsymbol{\beta}^T=(2,0,0,4,8).\] The other three choices as well as variations thereof can be done later in the lab.

The data generating model is \[\pmb{y} = \pmb{X}\pmb{\beta} + \pmb{\epsilon},\] with \(\pmb{\epsilon} \sim N(\pmb{0},\pmb{I})\).

To have reproducable results, set the random seed to 2 (it’s the second tutorial), generate the response vector \(\pmb{y}\), by y = 2 + X %*% c(0,0,4,8) + rnorm(n), note that %*% multiplies the matrix \(\pmb{X}\) with the (column) vector \(\pmb{\beta}\) and rnorm(n) generates \(n\) standard normal distributed pseudo-random numbers.

set.seed(2)
n = 40
y = 2 + X %*% c(0,0,4,8) + rnorm(n)

Next, fit the full model using lm. Store the full model in an R object called lm.full and then produce its summary output.

lm.full = lm(y~X)
summary(lm.full)

Confirm that there are three highly significant variables, ‘unsurprisingly’ the (Intercept), Xx4 and Xx5.

Weighted bootstrap

Many fitting functions have a weights option, which can be used to implement the weighted bootstrap. A simple way to achieve this is by drawing repeatedly exponential bootstrap weights through w = rexp(n).

As an aside, the vector w has length n, with non-negative elements (the weights) that approximately sum to a number close to n.

First, observe what happens for a single run of the weighted boostrap, by reweighing observations with w and storing the output of lm(y~X, weights=w) in the R object lm.full.b.

w = rexp(n)
length(w) # is of course n=40
sum(w) # is approximately 40
lm.full.b = lm(y~X, weights=w)
summary(lm.full.b)

Explore the weighted bootstrap next.

Observe how the coefficient values change for the full model above over initially \(B=10\) bootstrap samples (consider \(B=100\) and \(B=1000\) later, to see what changes and what more can be learned as \(B\) grows).

To achieve this, we repeat the above \(B\) times by writing a for loop and storing the coefficient vectors in the rows of a \(B\times p\) coefficient matrix C. (If you are a loop-novice and want to know more, you could work through https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r)

Store these coefficient values in a \(B\times p\) matrix C and return C by typing the following code:

B = 10
C = matrix(nrow=B,ncol=5)
for (b in 1:B){
  w = rexp(n) # is inside the loop, otherwise weights don't change
  lm.full.b = lm(y~X, weights=w)
  C[b,] = round(coef(lm.full.b),2)
}
colnames(C) = c("Intcpt","Slope_1","Slope_2","Slope_3","Slope_4")
C

Next, calculate for each column (showing the bootstrapped regression coefficients) their standard deviation (using sd) and draw a boxplot of the coefficient estimates.

To apply a function to every column of a matrix C, use the apply function. (If you are a apply-novice and want to know more, you could work through https://www.datacamp.com/community/tutorials/r-tutorial-apply-family)

apply(C,2,sd)
boxplot(C,horizontal=TRUE)

We observe that the first slope in particular varies a lot over the bootstrap sample, partly explained by the correlation structure in the data. Check the scatterplot (e.g. with pairs) and correlation matrix (e.g. with cor) to confirm: as an input to these two functions use the data frame dat = data.frame(y,X).

Calculate the bootstrap means of the coefficiencts (columns of C) and compare these values to the estimated coefficients for the full model above and the true parameter vector. Note that the bootstrapped values are approximately unbiased for the estimated coefficients but not for the true coefficient values.

Repeat the above but change the number of bootstrap replications to \(B=100\) and then to \(B=1000\).

Further reading: The R packages boot and bootstrap both have functions to implement different versions of the regression bootstrap.

Three types of plots with vis()

Produce the three types of plots with the default values of the vis() function in the R package mplot. Store the output of vis in vis.out.

Check help(plot.vis) and note that the types of plots are which = c("vip", "lvk", "boot").

By default, the vis() function adds a sixth additional redundant variable. Reproduce the plots, setting redundant = FALSE.

Note: if you’re using markdown and you want the mplot plots to appear, you need to specify results='asis' in the chunck options and add tag="chart" to the parameters of the plot() call.

Without the redundant variable.

Different data generating models

Explore what changes when the size of the data generating model changes, that is choosing \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((9,0,4,8)\) or \((9,6,4,8)\)

Clearly, the coefficients are all fairly large compared to the error standard deviation, which is \(\sigma=1\). Explore what happens above when choosing rescaled slopes, in particular for \(\boldsymbol{\beta}_{\text{slopes}}^T/2\) and \(\boldsymbol{\beta}_{\text{slopes}}^T/4\).

Bootstrapping the lasso

Even though for this small data set exhaustive searches are the natural way to do model selection, we explore the bglmnet() function. Among others, we can get the variable inclusion plot from learning only from the models on the bootstraped lasso paths.

To begin with, use the first simulaion setting with \(\pmb{\beta}^T = (2,0,0,4,8)\).