Lab 4: Graphical model stability and variable selection procedures with mplot

Learning outcomes

The mplot package
Variable inclusion plots
Model stability plots
Simplified adaptive fence with modifications
Bootstraping the lasso

Required R packages:

install.packages(c("mplot","pairsD3","Hmisc"))

Diabetes example - Demonstrator version

Start exploring the diabetes example through the demonstrator version of the shiny interface that can be obtained through mplot() by loading http://garthtarr.com/apps/mplot/ or using the code below.

library(mplot)
load(url("http://garthtarr.com/dbeg.RData"))
# makes available objects full.mod, af1, v1 and bgn1
mplot(full.mod,af1,v1,bgn1)

Solid waste data of Gunst and Mason (1980)

This is a famous design matrix, often used to test variable selection methods. Here \(n=40\) and the number of slopes is \(k=4\), the total number of regression parameters is therefore \(p=5\). The design matrix is available in the bestglm package.

data("Shao", package = "bestglm")
X = as.matrix.data.frame(Shao)

Confirm that the number of rows and columns are 40 and 4, respectively. Return the first row of the design matrix \(X\) and note that it has not yet a column of ones representing the intercept.

In the literature, the intercept is often chosen to be \(\beta_0 = 2\) and four data generating slope vectors are frequently used for this data: \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((0,0,4,8)\), \((9,0,4,8)\) or \((9,6,4,8)\).

We continue with looking at the smaller of the two intermediate cases, choosing \(\boldsymbol{\beta}^T=(2,0,0,4,8)\). The other three choices as well a variations thereof can be done later in the lab.

The data generating model is \(\pmb{y} = \pmb{X}\pmb{\beta} + \pmb{\epsilon}\), with \(\pmb{\epsilon} \sim N(\pmb{0},\pmb{I})\).

To have reproducable results, set the random seed to 4 (it’s the fourth tutorial), generate the response vector and fit the full model using lm.

set.seed(4)
n = 40
y = 2 + X %*% c(0,0,4,8) + rnorm(n)
lm.full = lm(y~X)
summary(lm.full)

Weighted bootstrap

Many fitting functions have a weights option, which can be used to implement the weighted bootstrap. A simple way to achieve this is by drawing exponential bootstrap weights through w = rexp(n).

Explore the weighted bootstrap next.

Observe how the coefficient values change for the full model above over initially \(B=10\) bootstrap samples (consider \(B=100\) and \(B=1000\) later, to see what changes). Store these coefficient values in a \(B\times p\) matrix C, calculate for each coefficient standard deviation and draw a boxplot of the coefficient estimates.

B = 10
C = matrix(nrow=B,ncol=5)
for (b in 1:B){
  w = rexp(n)
  lm.full.b = lm(y~X, weights=w)
  C[b,] = round(coef(lm.full.b),2)
}
colnames(C) = c("Intcpt","Slope_1","Slope_2","Slope_3","Slope_4")
C
apply(C,2,sd)
boxplot(C,horizontal=TRUE)

We observe that the first slope in particular varies a lot over the bootstrap sample, partly explained by the correlation structure in the data. Check the scatterplot and correlation matrix to confirm.

Calculate the bootstrap means of the coefficiencts (columns of C) and compare these values to the estimated coefficients for the full model above and the true parameter vector. Note that the boostrapped values are approximately unbiased for the estimated coefficients but not for the true coefficient values.

Repeat the above but change the number of bootstrap replications to \(B=100\) and then to \(B=1000\).

Further reading: The R packages boot and bootstrap both have functions to implement different versions of the regression bootstrap.

Three types of plots with `vis()`

Produce the three types of plots with the default values of the vis() function in the R package mplot. Store the output of vis in vis.out.

Check help(plot.vis) and note that the types of plots are which = c("vip", "lvk", "boot").

By default, the vis() function adds a sixth additional redundant variable. Reproduce the plots, setting redundant = FALSE.

Note: if you’re using markdown and you want the mplot plots to appear, you need to specify results='asis' in the chunck options and add tag="chart" to the parameters of the plot() call.

The simplified adaptive fence with `af()`

Explore the simplified adaptive fence, look at the summary of the output of the af function and at its googleVis plot.

Finally get a shiny interface through mplot(lm.full,vis.out,af.out).

Different data generating models

Explore what changes when the size of the data generating model changes, that is choosing \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((9,0,4,8)\) or \((9,6,4,8)\)

Clearly, the coefficients are all fairly large compared to the error standard deviation, which is \(\sigma=1\). Explore what happens above when choosing rescaled slopes, in particular for \(\boldsymbol{\beta}_{\text{slopes}}^T/2\) and \(\boldsymbol{\beta}_{\text{slopes}}^T/4\).

Bootstrapping the lasso

Even though for this small data set exhaustive searches are the natural way to do model selection, we explore the bglmnet() function. Among others, we can get the variable inclusion plot from learning only from the models on the bootstraped lasso paths.

To begin with, use the first simulaion setting with \(\pmb{\beta}^T = (2,0,0,4,8)\).

Additional examples

Revisit any of the data sets in previous labs.