mplot
packageinstall.packages(c("mplot","pairsD3","Hmisc"))
Start exploring the diabetes example through the demonstrator version of the shiny interface that can be obtained through mplot()
by loading http://garthtarr.com/apps/mplot/ or using the code below.
library(mplot)
load(url("http://garthtarr.com/dbeg.RData"))
# makes available objects full.mod, af1, v1 and bgn1
mplot(full.mod,af1,v1,bgn1)
This is a famous design matrix, often used to test variable selection methods. Here \(n=40\) and the number of slopes is \(k=4\), the total number of regression parameters is therefore \(p=5\). The design matrix is available in the bestglm
package.
data("Shao", package = "bestglm")
X = as.matrix.data.frame(Shao)
Confirm that the number of rows and columns are 40 and 4, respectively. Return the first row of the design matrix \(X\) and note that it has not yet a column of ones representing the intercept.
In the literature, the intercept is often chosen to be \(\beta_0 = 2\) and four data generating slope vectors are frequently used for this data: \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((0,0,4,8)\), \((9,0,4,8)\) or \((9,6,4,8)\).
We continue with looking at the smaller of the two intermediate cases, choosing \(\boldsymbol{\beta}^T=(2,0,0,4,8)\). The other three choices as well a variations thereof can be done later in the lab.
The data generating model is \(\pmb{y} = \pmb{X}\pmb{\beta} + \pmb{\epsilon}\), with \(\pmb{\epsilon} \sim N(\pmb{0},\pmb{I})\).
To have reproducable results, set the random seed to 4 (it’s the fourth tutorial), generate the response vector and fit the full model using lm
.
set.seed(4)
n = 40
y = 2 + X %*% c(0,0,4,8) + rnorm(n)
lm.full = lm(y~X)
summary(lm.full)
Many fitting functions have a weights
option, which can be used to implement the weighted bootstrap. A simple way to achieve this is by drawing exponential bootstrap weights through w = rexp(n)
.
Explore the weighted bootstrap next.
Observe how the coefficient values change for the full model above over initially \(B=10\) bootstrap samples (consider \(B=100\) and \(B=1000\) later, to see what changes). Store these coefficient values in a \(B\times p\) matrix C
, calculate for each coefficient standard deviation and draw a boxplot of the coefficient estimates.
B = 10
C = matrix(nrow=B,ncol=5)
for (b in 1:B){
w = rexp(n)
lm.full.b = lm(y~X, weights=w)
C[b,] = round(coef(lm.full.b),2)
}
colnames(C) = c("Intcpt","Slope_1","Slope_2","Slope_3","Slope_4")
C
apply(C,2,sd)
boxplot(C,horizontal=TRUE)
We observe that the first slope in particular varies a lot over the bootstrap sample, partly explained by the correlation structure in the data. Check the scatterplot and correlation matrix to confirm.
Calculate the bootstrap means of the coefficiencts (columns of C
) and compare these values to the estimated coefficients for the full model above and the true parameter vector. Note that the boostrapped values are approximately unbiased for the estimated coefficients but not for the true coefficient values.
Repeat the above but change the number of bootstrap replications to \(B=100\) and then to \(B=1000\).
Further reading: The R packages boot
and bootstrap
both have functions to implement different versions of the regression bootstrap.
vis()
Produce the three types of plots with the default values of the vis()
function in the R package mplot
. Store the output of vis
in vis.out
.
Check help(plot.vis)
and note that the types of plots are which = c("vip", "lvk", "boot")
.
By default, the vis()
function adds a sixth additional redundant variable. Reproduce the plots, setting redundant = FALSE
.
Note: if you’re using markdown and you want the mplot plots to appear, you need to specify results='asis'
in the chunck options and add tag="chart"
to the parameters of the plot()
call.
af()
Explore the simplified adaptive fence, look at the summary of the output of the af
function and at its googleVis plot.
Finally get a shiny interface through mplot(lm.full,vis.out,af.out)
.
Explore what changes when the size of the data generating model changes, that is choosing \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((9,0,4,8)\) or \((9,6,4,8)\)
Clearly, the coefficients are all fairly large compared to the error standard deviation, which is \(\sigma=1\). Explore what happens above when choosing rescaled slopes, in particular for \(\boldsymbol{\beta}_{\text{slopes}}^T/2\) and \(\boldsymbol{\beta}_{\text{slopes}}^T/4\).
Even though for this small data set exhaustive searches are the natural way to do model selection, we explore the bglmnet()
function. Among others, we can get the variable inclusion plot from learning only from the models on the bootstraped lasso paths.
To begin with, use the first simulaion setting with \(\pmb{\beta}^T = (2,0,0,4,8)\).
Revisit any of the data sets in previous labs.