Fast algorithms and modern visualisations for feature selection

Please note this website will be under active development until Monday 20 July, there will be additions and minor changes to the material.

Welcome

Instructors

Samuel Mueller is a Professor of Statistics and Interim Head of School at the University of Sydney, Fellow of the American Statistical Association and has 18 years experience as a mathematical statistician renowned for his contributions in model selection, classification and prediction for statistically challenging data. He currently leads two research groups on Theoretical Statistical Model Selection at the ANU (with Prof Welsh) and on Fast and Interactive Methods for Complex High-Dimensional Data at USyd. He was appointed by the Australian Research Council on their College of Experts for 2019-2021 and is an Editor (Theory & Methods) of the Australian and New Zealand Journal of Statistics. He has also held various offices in the IBS, currently serving a four year term as a Council Member.

Garth Tarr is a Senior Lecturer and Associate Head (Education) in the School of Mathematics and Statistics at the University of Sydney. He has received more than A$4M in competitive grant funding and a number of citations for his teaching, including a Vice-Chancellor’s Award for Teaching Excellence in 2016. He received his PhD in Mathematical Statistics from the University of Sydney and has held positions at the University of Newcastle and the Australian National University. His diverse interests include robust statistics, data visualisation, model selection, educational research, meat science and biostatistics. Garth is an expert R user and has created several R packages, including the mplot package. He is an Associate Editor for the Australian and New Zealand Journal of Statistics and the Biometric Bulletin’s Software Corner.

Background

This short course focuses on model selection techniques regression models in two scenarios: when an extensive search of the model space is possible as well as when the dimension is large and either stepwise algorithms or regularisation techniques have to be employed to identify good models. We incorporate recent research on graphical tools for model choice and touch on how to tune regularisation procedures, such as the Lasso through resampling or model selection criteria. Importantly, the limitations of the various model selection procedures will be discussed. A key component of the course is assessing the stability of selected components which is paramount for reliable predictive final models. We show how this can be achieved through visualizing measures of stability.

Objectives

The practical implementation of the discussed methods is an essential component of this course. Interactive labs will give participants the opportunity to apply what they have learnt with some material that can be done after the course, to further digest the material. We will use the cross-platform, open-source software R, in particular we will make use of the lmSubsets, bestglm, glmnet and mplot packages.

Assumed knowledge

It will be assumed that participants are familiar with R and standard regression modelling techniques.

Schedule

Links to the resources will be made available on Monday 20 July. Participants are expected to work through the material, watching the recordings and attempting the lab questions, at a time that suits them over a three day period. We will run two Zoom drop in sessions on Thursday 23 July 10-11 AM and 4-5 PM (Sydney AEST). This corresponds to:

9-10AM and 3-4PM Thursday 23 July in Seoul
1-2AM and 7-8AM Thursday 23 July in London
8-9PM Wednesday 22 July and 2-3AM Thursday 23 July in New York

During these Zoom sessions we will answer any questions participants may have. The Zoom links will be emailed directly to registered participants.

Resources and recordings

All links will be posted by Monday 20 July.

Component	Resources
Lecture A: Exhaustive and non-exhaustive algorithms without resampling	Slides \| PDF \| R code
Part A1: Selecting models	Video \| Audio
Part A2: Regularisation methods	Video \| Audio
Part A3: Marginality principle	Video \| Audio
Lab A	Questions \| Solutions
Lecture B: Exhaustive and non-exhaustive algorithms with resampling	Slides \| PDF \| R code
Part B1: Cross-validation for model selection	Video \| Audio
Part B2: The mplot package	Video \| Audio
Part B3: Subtractive stability measures	Video \| Audio
Lab B	Questions \| Solutions

Computing requirements

Latest version of R (v4.0.2 or later)
Latest version of RStudio (v1.3 or later) (recommended, not required)
Install the following packages (or ensure you have the latest versions):

install.packages("lmSubsets")
install.packages("bestglm")
install.packages("lars")
install.packages("mplot")
install.packages("MASS")
install.packages("Hmisc")
install.packages("car")
install.packages("mfp")