Data visualisation, interactive data analysis, statistical programming

Garth Tarr

Case study: gene expression in starvation

Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA- what we call being "expressed". (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of "cellular switchboard" that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell's activity.

Brauer (2008) used microarrays to test the effect of starvation and growth rate on baker’s yeast (S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast's supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

"Starving" the yeast of these nutrients lets us find genes that:

Featured R packages

install.packages(c("readr", "tidyr", "dplyr"))

Get the data and tidy it

The data, tidying code and examples are borrowed from here and here.

require(readr)
original_data = read_delim("http://www.maths.usyd.edu.au/u/gartht/Brauer2008_DataSet1.tds", 
    delim = "\t")
## View(original_data)  # opens a spreadsheet view in RStudio
dim(original_data)
## [1] 5537   40

Fix the name column by splitting on ||, remove white space and drop unecessary variables. We also want to ensure that we have tidy data -- each variable should be one column - the column headers are values not variable names.

require(dplyr)
require(tidyr)

nutrient_names <- c(G = "Glucose", L = "Leucine", P = "Phosphate",
                    S = "Sulfate", N = "Ammonia", U = "Uracil")


cleaned_data = original_data %>%
  separate(NAME, 
           c("name", "BP", "MF", "systematic_name", "number"), 
           sep = "\\|\\|") %>%
  mutate_each(funs(trimws), name:systematic_name) %>%
  dplyr::select(-number, -GID, -YORF, -GWEIGHT)  %>%
  gather(sample, expression, G0.05:U0.3) %>%
  separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>%
  mutate(nutrient = plyr::revalue(nutrient, nutrient_names)) %>%
  filter(!is.na(expression), systematic_name != "")
names(cleaned_data)
## [1] "name"            "BP"              "MF"              "systematic_name"
## [5] "nutrient"        "rate"            "expression"

The above code chunk is doing a lot of processing very sucinctly using the pipe operator (see the magrittr package for details). The gather() function melts the data - instead of one row per gene, we now have one row per gene per sample. We've gathered 36 columns together into two variables (expression and sample) then separated sample out into two variables (nutrient and rate). Note that sample never appears in the final output... the wonders of the pipe (%>%) operator.

Visualise the data

Below is a classical approach using ggplot2.

Look at one gene LEU1:

library(ggplot2)
cleaned_data %>%
  filter(name == "LEU1") %>%
  ggplot(aes(rate, expression, color = nutrient)) +
  geom_line()

plot of chunk unnamed-chunk-4

LEU1's expression is far higher (more “turned on”) when the cell is being starved of leucine than in any other condition, because in that case the cell has to synthesize its own leucine. And as the amount of leucine in the environment (the growth rate) increases, the cell can focus less on leucine production, and the expression of those genes go down. We’ve just gotten one snapshot of our gene’s regulatory network, and how it responds to external stimuli.

To look at all the genes in the leucine biosynthesis process subset using the BP variable, to filter for all genes in that process, and then facet to create sub-plots for each.

cleaned_data %>%
  filter(BP == "leucine biosynthesis") %>%
  ggplot(aes(rate, expression, color = nutrient)) +
  geom_line() +
  facet_wrap(~name)

plot of chunk unnamed-chunk-5

LEU1, LEU2, and LEU4 all show a similar pattern, where starvation of leucine causes higher gene expression. LEU4 also appears to respond to glucose starvation as well.

We can do some basic model fitting, adding lines of best fit using geom_smooth(method = "lm", se = FALSE):

cleaned_data %>%
  filter(BP == "leucine biosynthesis") %>%
  ggplot(aes(rate, expression, color = nutrient)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~name)

plot of chunk unnamed-chunk-6

This is great, but there are so many different genes and processes - how can we look at them all? How do you deal with collaborators who are constantly harassing you asking for plots of different genes? Shiny is the answer!

Instructions

install.packages(c("readr", "dplyr", "tidyr", "ggplot2"))
### Data preparation
require(readr)
require(dplyr)
require(tidyr)
require(ggplot2)
original_data = read_delim("http://www.maths.usyd.edu.au/u/gartht/Brauer2008_DataSet1.tds", delim = "\t")
nutrient_names <- c(G = "Glucose", L = "Leucine", P = "Phosphate",
                    S = "Sulfate", N = "Ammonia", U = "Uracil")
cleaned_data = original_data %>%
  separate(NAME, 
           c("name", "BP", "MF", "systematic_name", "number"), 
           sep = "\\|\\|") %>%
  mutate_each(funs(trimws), name:systematic_name) %>%
  dplyr::select(-number, -GID, -YORF, -GWEIGHT)  %>%
  gather(sample, expression, G0.05:U0.3) %>%
  separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>%
  mutate(nutrient = plyr::revalue(nutrient, nutrient_names)) %>%
  filter(!is.na(expression), systematic_name != "")
output$plot1 = renderPlot({
  cleaned_data %>%
    filter(name == "LEU1") %>%
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_line() + theme_bw(base_size = 14) + 
    facet_wrap(~name + systematic_name)
})
plotOutput("plot1")
selectizeInput(inputId = "gene",
               label = "Select gene",
               choices = sort(unique(cleaned_data$name)),
               selected = "LEU1",
               multiple = FALSE)
output$plot1 = renderPlot({
  cleaned_data %>%
    filter(name == input$gene) %>%
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_line() + theme_bw(base_size = 14) + 
    facet_wrap(~name + systematic_name)
})
sidebarPanel(
  selectizeInput(inputId = "gene",
                 label = "Select gene(s)",
                 choices = sort(unique(cleaned_data$name)),
                 selected = "LEU1",
                 multiple = TRUE),
  checkboxInput(inputId = "line",
                label = "Add line of best fit?",
                value = FALSE)
)
output$plot1 = renderPlot({
  if(input$line){
    cleaned_data %>%
      filter(is.element(name, input$gene)) %>%
      ggplot(aes(rate, expression, color = nutrient)) +
      geom_point() + theme_bw(base_size = 14) + 
      geom_smooth(method = "lm", se = FALSE)  + 
      facet_wrap(~name)
  } else {
    cleaned_data %>%
      filter(is.element(name, input$gene)) %>%
      ggplot(aes(rate, expression, color = nutrient)) +
      geom_line() + theme_bw(base_size = 14) + 
      facet_wrap(~name + systematic_name)
  }
})

Suggested solution

See here for suggested solutions - though there's many ways to accomplish the same thing, so yours might not be exactly the same. Try not to cheat, unless you really have to.

References