install.packages(c("mplot","leaps","bestglm","car"))
data("UScrime", package = "MASS")
M0 = lm(y ~ 1,data=UScrime) # Null model
M1 = lm(y ~ .,data=UScrime) # Full model
Use the regsubsets() function from the leaps package to obtain the 50 best models for each model size and plot their RSS against model dimension.
Use the method = "forward" and method = "backward" and method = "exhaustive" options to perform forward, backward and exhaustive model selection and compare the results (only find a single best model at each dimension). Hint: The regsubsets function returns several information criteria, choose for example Mallow’s Cp.
Use the plot method associated with regsubsets objects to visualise the BIC for the various models identified.
The car package offers alternative plot methods to visualise the results from a call to regsubsets(). The code below plots \(C_p\) against dimension for a regsubsets object called exh. Try plotting adjusted \(R^2\) against dimension.
library(car)
par(mar = c(4.5,4.5,1,1),mfrow=c(1,1))
subsets(exh, statistic="cp", legend = FALSE, max.size=8)
abline(a = 0, b = 1, lty = 2)
The second example is the birthwt dataset from the MASS package which has data on 189 births at the Baystate Medical Centre, Springfield, Massachusetts during 1986 (Venables and Ripley, 2002) The main variable of interest is low birth weight, a binary response variable low (Hosmer and Lemeshow, 1989). Take the same approach to modelling the full model as in Venables and Ripley (2002), where ptl is reduced to a binary indicator of past history and ftv is reduced to a factor with three levels:
library(MASS)
bwt <- with(birthwt, {
race <- factor(race, labels = c("white", "black", "other"))
ptd <- factor(ptl > 0)
ftv <- factor(ftv)
levels(ftv)[-(1:2)] <- "2+"
data.frame(low = factor(low), age, lwt, race, smoke = (smoke > 0),
ptd, ht = (ht > 0), ui = (ui > 0), ftv)
})
options(contrasts = c("contr.treatment", "contr.poly"))
bw.glm1 <- glm(low ~ ., family = binomial, data = bwt)
bw.glm0 <- glm(low ~ 1, family = binomial, data = bwt)
summary(bw.glm1)
Use forward and backward stepwise to identify potential models. Do you get the same model using both approaches?
Use the bestglm() function from the bestglm package to identify the top 10 models according to the AIC.
Use the cross-validation argument in bestglm to identify the best rock-wallaby model using cross-validation.
data("wallabies", package = "mplot")
names(wallabies)
wdat = data.frame(subset(wallabies, select = -c(lat, long)),
EaD = wallabies$edible * wallabies$distance,
EaS = wallabies$edible * wallabies$shelter,
DaS = wallabies$distance * wallabies$shelter)
X = subset(wdat, select = -rw)
y = wdat$rw
Xy = as.data.frame(cbind(X, y))
The basic idea behind stagewise modelling is:
i in large_number:
Implement the stagewise procedure for the artificial example introduced in the lecture. Hint: the core of the function will look like this:
data("artificialeg",package="mplot")
y = artificialeg$y-mean(artificialeg$y)
X = artificialeg[,-10]
y = # mean zero predictors
M = # standardised predictor matrix
lots = 20000
beta = matrix(0, ncol=ncol(M), nrow=lots)
r = y
eps = 0.001
for (i in 2:lots){
co = cor(M,r)
j =
delta =
b = beta[i-1,]
b[j] =
beta[i,] = b
r =
}
matplot().