Background

Statistical model building is a fundamental part of many statistical analyses. The aim is to use the data and, if available, information about its generating process, to construct statistical models which parsimoniously describe relevant and important features in the data. Arguably, the most widely used method for selecting variables is to minimise either Akaike’s information criterion (AIC) or the Bayesian information criterion (BIC) and their variants. However, AIC and BIC are not the only criteria of interest for the optimal selection of models. A major advance in the field is the Lasso and other related recent methods that use regularization. The Lasso and its extensions can handle data that involve more predictor variables than samples, the ‘large p, small n’ problem. The stability of selected components is paramount for reliable predictive final model(s), which can be achieved through stability paths, a product of repeated model selection on bootstrapped or cross-validated samples.

Objectives

This short course focuses on model selection techniques for linear and generalised linear regression in two scenarios: when an extensive search of the model space is possible as well as when the dimension is large and either stepwise algorithms or regularisation techniques have to be employed to identify good models. We incorporate recent research on graphical tools for model choice and on how to tune regularisation procedures, such as the Lasso through resampling or model selection criteria.

The practical implementation of the discussed methods is an essential component of this course. Interactive labs will give participants the opportunity to apply what they have learnt. We will use the cross-platform, open-source software R, in particular the leaps, bestglm, glmnet and mplot packages.