install.packages(c("mplot","leaps","bestglm","car"))
data("UScrime", package = "MASS")
M0 = lm(y ~ 1,data=UScrime) # Null model
M1 = lm(y ~ .,data=UScrime) # Full model
Use the regsubsets()
function from the leaps
package to obtain the 50 best models for each model size and plot their RSS against model dimension.
Use the method = "forward"
and method = "backward"
and method = "exhaustive"
options to perform forward, backward and exhaustive model selection and compare the results (only find a single best model at each dimension). Hint: The regsubsets
function returns several information criteria, choose for example Mallow’s Cp.
Use the plot method associated with regsubsets
objects to visualise the BIC for the various models identified.
The car
package offers alternative plot methods to visualise the results from a call to regsubsets()
. The code below plots \(C_p\) against dimension for a regsubsets
object called exh
. Try plotting adjusted \(R^2\) against dimension.
library(car)
par(mar = c(4.5,4.5,1,1),mfrow=c(1,1))
subsets(exh, statistic="cp", legend = FALSE, max.size=8)
abline(a = 0, b = 1, lty = 2)
The second example is the birthwt
dataset from the MASS
package which has data on 189 births at the Baystate Medical Centre, Springfield, Massachusetts during 1986 (Venables and Ripley, 2002) The main variable of interest is low birth weight, a binary response variable low
(Hosmer and Lemeshow, 1989). Take the same approach to modelling the full model as in Venables and Ripley (2002), where ptl
is reduced to a binary indicator of past history and ftv
is reduced to a factor with three levels:
library(MASS)
bwt <- with(birthwt, {
race <- factor(race, labels = c("white", "black", "other"))
ptd <- factor(ptl > 0)
ftv <- factor(ftv)
levels(ftv)[-(1:2)] <- "2+"
data.frame(low = factor(low), age, lwt, race, smoke = (smoke > 0),
ptd, ht = (ht > 0), ui = (ui > 0), ftv)
})
options(contrasts = c("contr.treatment", "contr.poly"))
bw.glm1 <- glm(low ~ ., family = binomial, data = bwt)
bw.glm0 <- glm(low ~ 1, family = binomial, data = bwt)
summary(bw.glm1)
Use forward and backward stepwise to identify potential models. Do you get the same model using both approaches?
Use the bestglm()
function from the bestglm
package to identify the top 10 models according to the AIC.
Use the cross-validation argument in bestglm
to identify the best rock-wallaby model using cross-validation.
data("wallabies", package = "mplot")
names(wallabies)
wdat = data.frame(subset(wallabies, select = -c(lat, long)),
EaD = wallabies$edible * wallabies$distance,
EaS = wallabies$edible * wallabies$shelter,
DaS = wallabies$distance * wallabies$shelter)
X = subset(wdat, select = -rw)
y = wdat$rw
Xy = as.data.frame(cbind(X, y))
The basic idea behind stagewise modelling is:
i in large_number
:
Implement the stagewise procedure for the artificial example introduced in the lecture. Hint: the core of the function will look like this:
data("artificialeg",package="mplot")
y = artificialeg$y-mean(artificialeg$y)
X = artificialeg[,-10]
y = # mean zero predictors
M = # standardised predictor matrix
lots = 20000
beta = matrix(0, ncol=ncol(M), nrow=lots)
r = y
eps = 0.001
for (i in 2:lots){
co = cor(M,r)
j =
delta =
b = beta[i-1,]
b[j] =
beta[i,] = b
r =
}
matplot()
.