caret
packagemplot
packageinstall.packages(c("mplot","pairsD3","Hmisc","caret"))
In lab A we found that both forward and backward stepwise methods selected the same model Fore
, Neck
, Weight
and Abdo
. An F test “confirmed” that this was better than a simple linear regression of Bodyfat
on Abdo
, however this was all done “in-sample”. It may be of interest to compare these two models on their out-of-sample predictive ability using cross validation with the caret
package.
library(caret)
data("bodyfat", package = "mplot")
# help("bodyfat", package = "mplot")
names(bodyfat)
bfat = bodyfat[,-1] # remove the ID column
# specify how we want the cross validation to run
set.seed(2)
control = trainControl(method = "cv", number = 10)
lm4_cv = train(Bodyfat ~ Fore + Neck + Weight + Abdo, method = "lm",
data = bfat, trControl = control)
lm4_cv
lm1_cv = train(Bodyfat ~ Abdo, method = "lm",
data = bfat, trControl = control)
lm1_cv
results <- resamples(list(lm4 = lm4_cv, lm1 = lm1_cv))
summary(results)
library(ggplot2)
ggplot(results) + theme_classic() + labs(y = "MAE")
ggplot(results, metric = "RMSE") +
theme_classic() +
labs(y = "RMSE")
ggplot(results, metric = "Rsquared") +
theme_classic() +
labs(y = "R-squared")
Your turn: Repeat the above, but perform repeated 5-fold cross validation with 10 repeats. Hint: in the trainControl()
function use the repeats
argument.
For further information about the caret package see here.
Start exploring the diabetes example through the demonstrator version of the shiny interface that can be obtained through mplot()
either by loading http://garthtarr.com/apps/mplot/ or by using the code below.
library(mplot)
load(url("http://garthtarr.com/dbeg2019.RData"))
# makes available objects full.mod, af1,v1 and bgn1
mplot(full.mod,af1,v1,bgn1)
# mplot takes as an input a full model in its first slot
# and then one or more objects of type vis, af or bglmnet
If there is no left black sidebar, click in the blue heading on the menu/sidebar icon.
1. Explore visually the ‘Variable inclusion’ plot and do the following:
2. Explore the ‘Bootstrap glmnet’ tab next (for the diabetes example, we can easily do an exhaustive variable inclusion plot as seen above; however, in general, this may not be feasible, so bootstrapping glmnet will be more feasible):
Read in the dataset UScrime
from the MASS
package and fit a full model (called M1
). Note that you can access the t-statistic values by returning the third column from M1.sum$coefficients
where M1.sum = summary(M1)
data("UScrime", package = "MASS")
M1 = lm(y ~ .,data=UScrime) # Full model
M1.sum = summary(M1)
M1.sum$coefficients
M1.sum$coefficients[,3]
Note that the regression coefficient for Prob
is very large but also its corresponding standard error is very large; additionally, the standard errors vary considerably from variable to variable.
Copy and paste the sstab
function from Lecture B, then modify it to generate \(B=200\) weighted bootstrap t statistics, that is standardizing estimated regression coefficients by dividing this with their standard error, store this matrix in an object called sstab.out
.
Calculate the absolute values of the bootstrapped t-statistics, delete the first column (the values for the intercept) and store it in an object called sj
.
Calculate the mean rank of the absolute t-values (copy and paste from the lecture) and note that Ineq
and Ed
have highest rank.
Produce a variable inclusion plot and a bootstrapped loss versus dimension plot, can you confirm the above observation?
This is a famous design matrix, often used to test variable selection methods. Here \(n=40\) and the number of slopes (explanatory variables) is \(k=4\), the total number of regression parameters is therefore \(p=5\). The design matrix is available in the bestglm
package.
data("Shao", package = "bestglm")
X = as.matrix.data.frame(Shao)
Confirm that the number of rows and columns are 40 and 4, respectively using the function dim
(alternatively you could use the functions ncol
or nrow
).
Return the first few rows of the design matrix \(X\) using the command head
and note that it has not yet a column of ones representing the intercept.
In the literature, the intercept is often chosen to be \(\beta_0 = 2\) and four data generating slope vectors are frequently used for this data: \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((0,0,4,8)\), \((9,0,4,8)\) or \((9,6,4,8)\).
We continue with looking at the smaller of the two intermediate cases, choosing \[\boldsymbol{\beta}^T=(2,0,0,4,8).\] The other three choices as well as variations thereof can be done later in the lab.
The data generating model is \[\pmb{y} = \pmb{X}\pmb{\beta} + \pmb{\epsilon},\] with \(\pmb{\epsilon} \sim N(\pmb{0},\pmb{I})\).
To have reproducable results, set the random seed to 2 (it’s the second tutorial), generate the response vector \(\pmb{y}\), by y = 2 + X %*% c(0,0,4,8) + rnorm(n)
, note that %*%
multiplies the matrix \(\pmb{X}\) with the (column) vector \(\pmb{\beta}\) and rnorm(n)
generates \(n\) standard normal distributed pseudo-random numbers.
set.seed(2)
n = 40
y = 2 + X %*% c(0,0,4,8) + rnorm(n)
Next, fit the full model using lm
. Store the full model in an R object called lm.full
and then produce its summary
output.
lm.full = lm(y~X)
summary(lm.full)
Confirm that there are three highly significant variables, ‘unsurprisingly’ the (Intercept)
, Xx4
and Xx5
.
Many fitting functions have a weights
option, which can be used to implement the weighted bootstrap. A simple way to achieve this is by drawing repeatedly exponential bootstrap weights through w = rexp(n)
.
As an aside, the vector w
has length n
, with non-negative elements (the weights) that approximately sum to a number close to n
.
First, observe what happens for a single run of the weighted boostrap, by reweighing observations with w
and storing the output of lm(y~X, weights=w)
in the R object lm.full.b
.
w = rexp(n)
length(w) # is of course n=40
sum(w) # is approximately 40
lm.full.b = lm(y~X, weights=w)
summary(lm.full.b)
Explore the weighted bootstrap next.
Observe how the coefficient values change for the full model above over initially \(B=10\) bootstrap samples (consider \(B=100\) and \(B=1000\) later, to see what changes and what more can be learned as \(B\) grows).
To achieve this, we repeat the above \(B\) times by writing a for
loop and storing the coefficient vectors in the rows of a \(B\times p\) coefficient matrix C
. (If you are a loop-novice and want to know more, you could work through https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r)
Store these coefficient values in a \(B\times p\) matrix C
and return C
by typing the following code:
B = 10
C = matrix(nrow=B,ncol=5)
for (b in 1:B){
w = rexp(n) # is inside the loop, otherwise weights don't change
lm.full.b = lm(y~X, weights=w)
C[b,] = round(coef(lm.full.b),2)
}
colnames(C) = c("Intcpt","Slope_1","Slope_2","Slope_3","Slope_4")
C
Next, calculate for each column (showing the bootstrapped regression coefficients) their standard deviation (using sd
) and draw a boxplot of the coefficient estimates.
To apply a function to every column of a matrix C
, use the apply
function. (If you are a apply-novice and want to know more, you could work through https://www.datacamp.com/community/tutorials/r-tutorial-apply-family)
apply(C,2,sd)
boxplot(C,horizontal=TRUE)
We observe that the first slope in particular varies a lot over the bootstrap sample, partly explained by the correlation structure in the data. Check the scatterplot (e.g. with pairs
) and correlation matrix (e.g. with cor
) to confirm: as an input to these two functions use the data frame dat = data.frame(y,X)
.
Calculate the bootstrap means of the coefficiencts (columns of C
) and compare these values to the estimated coefficients for the full model above and the true parameter vector. Note that the bootstrapped values are approximately unbiased for the estimated coefficients but not for the true coefficient values.
Repeat the above but change the number of bootstrap replications to \(B=100\) and then to \(B=1000\).
Further reading: The R packages boot
and bootstrap
both have functions to implement different versions of the regression bootstrap.
vis()
Produce the three types of plots with the default values of the vis()
function in the R package mplot
. Store the output of vis
in vis.out
.
Check help(plot.vis)
and note that the types of plots are which = c("vip", "lvk", "boot")
.
By default, the vis()
function adds a sixth additional redundant variable. Reproduce the plots, setting redundant = FALSE
.
Note: if you’re using markdown and you want the mplot plots to appear, you need to specify results='asis'
in the chunck options and add tag="chart"
to the parameters of the plot()
call.
Without the redundant variable.
Explore what changes when the size of the data generating model changes, that is choosing \(\boldsymbol{\beta}_{\text{slopes}}^T = (0,0,4,0)\), \((9,0,4,8)\) or \((9,6,4,8)\)
Clearly, the coefficients are all fairly large compared to the error standard deviation, which is \(\sigma=1\). Explore what happens above when choosing rescaled slopes, in particular for \(\boldsymbol{\beta}_{\text{slopes}}^T/2\) and \(\boldsymbol{\beta}_{\text{slopes}}^T/4\).
Even though for this small data set exhaustive searches are the natural way to do model selection, we explore the bglmnet()
function. Among others, we can get the variable inclusion plot from learning only from the models on the bootstraped lasso paths.
To begin with, use the first simulaion setting with \(\pmb{\beta}^T = (2,0,0,4,8)\).