This workshop introduces R as an efficient tool for data manipulation and visualisation. The ultimate goal is to develop competency in using R as a tool for data processing and analysis where the output can be used to gain insights and help inform decision making within industry. We begin with the basics, how to install R and RStudio and manage its packages. We cover importing data, data manipulation and cleaning, visualisation and standard modelling techniques.

An important component of the workshop is for you to perform an analysis on your own data set. We begin work on this in Part 3 and participants continue to work on it as “homework” before presenting thier results in the follow-up two days.

More detail can be found in the schedule below, though the program is flexible and the workshop will aim to be as adaptive to the needs of the participants as possible.

Source: Grolemund and Wickham (2016). R for Data Science.


You will need to bring your own laptop computer. If it is a computer provided by your organisation, please ensure you have admin privileges (i.e. you are able to install software).

In the week before the first workshop, please install the following software:

  • Latest version of R (v3.4.1 or later)
  • Latest version of Rstudio (select the free personal licence option).
  • Start up RStudio and enter the following commands in the Console:
# the tidyverse package will also install a number of
# other useful packages, e.g. dplyr, ggplot2, tidyr

Part 1: Introduction to R and RStudio

Part 1 is a customised version of this Software Carpentry workshop designed to familiarise you with the fundamentals of the R language, the RStudio integrated development environment and importing data.

Applied example

Part 2: The tidyverse

The tidyverse suite of packages makes many common data manipulation tasks in R easier. Many of the lessons below work with the gapminder data set, which you can download here gapminder-FiveYearData.csv and save it in your project’s data folder. To load it use:

gapminder = read_csv("data/gapminder-FiveYearData.csv")

Applied examples

  • Revisit the datasauRus data using ggplot2. Repeat some of Part 1’s exercises using ggplot2 and dplyr.
  • Grader data revisited
  • Practice what you’ve seen so far on this anonymised MyMSA data set, mymsa.csv

Part 3: BYO data

The major component of Part 3 is for you bring in your own data sets and apply many of the concepts we’ve introduced in your own context with helpers in the room to ensure you don’t get stuck at the first hurdle. The work you do in Part 3 will form the basis of your homework assignments.

Homework: your data project

In the week between the two parts of the workshop, you will need to do an analysis on your own, generate a markdown report and/or presentation. Depending on your skill level, this might be as simple as reading a file into R and summarising it with some descriptive statistics and plots.

Your 5-15 minute presentation should:

  • Be in Word, PowerPoint or an R markdown document - which can output either long form report (html, word or pdf) or as a presentation. Your choice here will likely reflect how you will need to present your results to your organisation, if it’s expected that your results come in a PowerPoint slide deck, then use PowerPoint. If you have the freedom to present your results how you’d like then pick whichever method you’d prefer.
  • Outline the question(s) you want to answer. I.e. what the problem that this data can help answer and how would answering that problem benefit your organisation? Try to explain what you hope a solution would look like and the benefits that could come as a result. We’re not expecting you to have solved it in a week, but think forward to what a perfect outcome might look like, and what would be required to achieve that outcome.
  • Tell us what’s involved in getting the data before it reaches you (e.g. do you ask someone to extract it from a database for you or do you collect the data yourself). Think about data integrity, has someone edited the data in Excel before it reaches you? Or have you edited with the data/column names in Excel before importing it into R? Any data fiddling needs to be recorded and reported (for reproducibility)!
  • Discussion about importing and cleaning the data.
    • How did you import the data into R?
    • Were there any issues you identified in the importing process (e.g. did variable names need cleaning, perhaps with the clean_names function from the janitor package, were there any outliers that needed to be dealt with, were there any duplicates…)
    • Show some tabulation and/or cross tabulation using the tabyl and crosstab functions from the janitor package.
  • Discussion about how you went about analysing the data, including the code (or at least key packages/functions).
    • Perform some filtering and selecting of rows using the dplyr package.
    • Show some plots using the ggplot2 package. If appropriate, try using a geom_... that we haven’t used before. For inspiration see the R Graph Gallery.
    • For all of the above, try to keep the R code as well as the plot, so we can see how you got your output.
  • Some preliminary results/output, have you been able to answer the question outlined above? If not, what is the hurdle? Insuffucient data? Data quality? Not sure how to get the data in the right format? Additional analysis required?
  • You don’t have to share the raw data, but if your organisation doesn’t mind and if the data is not confidential, it would be nice to use some of these as example activities for the next time the workshop is run.

If you’re stuck on something as you’re working through this talk to your colleagues or send Garth an email with an outline of what you’re trying to do, the code you’ve tried and the error message.

Part 4: Interactive data visualisation and analysis

Part 4 material will be based on this workshop.

  1. ggplot2 recap
  2. Homework show-and-tell - show the group what you came up with
  3. Correlations
  4. Adjusted means
  5. Interactive pivot tables with the rpivotTable package
  6. Strings with stringr overview | Activities
  7. Interactive visualisations presentation | code

Part 5: Consolidation

  1. Modelling with log transformed variables presentation | worksheet | data | data info |
  2. Shiny cheat sheet
  3. Revision and building a shiny web interfaces

What are the core tasks you want to be able to do when you go back to the office? These will be explored on Part 1, and will be revisited throughout the workshop, however, there will be dedicated time in Part 5 to make sure we meet these objectives.

  • Workflow discussion (how to institutionalise processes)
  • Outputting data (saving data files for sharing: csv and Rdata files)
  1. Saving plots and data
  2. Responsible coding

Advanced topics

Writing functions

Often you’ll want to do the same thing over and over again with different data sets. Writing your own function is a convienient way to make sure that you do the same thing every time.

  1. Introdution to functions
  2. Control flow
  3. Vectorisation

Working with very large data sets

  1. The data.table package


Please fill in this survey at the conclusion of the workshop.