The janitor package is a R package that has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimised for user-friendliness.

install.packages("janitor")

The main janitor functions:

  • perfectly format data frame column names;
  • isolate partially-duplicate records; and
  • provide quick tabulations (i.e., frequency tables and crosstabs).

Other janitor functions nicely format the results of these tabulations. Together, these tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.

The data set we’ll illustrate this on is some (anonymised) MyMSA data, mymsa.xlsx, save it to your project’s data folder.

Catalog of janitor functions

clean_names()

I call this function every time I read in a new data set.

It works in a %>% pipeline, and handles problematic variable names, especially those that are so well preserved by readxl::read_excel() and readr::read_csv().

  • Returns names with only lowercase letters, with _ as a separator
  • Handles special characters and spaces
  • Appends numbers to duplicated names
  • Converts “%” to “percent” to retain meaning

To begin we import the excel file into an object called mymsa and create a new object, x, that has clean names:

library(janitor)
library(readxl)
mymsa = read_excel("data/mymsa.xlsx")
x = janitor::clean_names(mymsa)

We can compare the column names of the imported data and the cleaned data frame:

data.frame(mymsa = colnames(mymsa), x = colnames(x))
##                mymsa                    x
## 1               RFID                 rfid
## 2              Plant                plant
## 3           KillDate            kill_date
## 4             BodyNo              body_no
## 5   LeftSideScanTime  left_side_scan_time
## 6  RightSideScanTime right_side_scan_time
## 7         HangMethod          hang_method
## 8                Hgp                  hgp
## 9                Sex                  sex
## 10          LeftHscw            left_hscw
## 11         RightHscw           right_hscw
## 12         TotalHscw           total_hscw
## 13             P8Fat                p8fat
## 14               Lot                  lot
## 15          Est % BI       est_percent_bi
## 16          HumpCold            hump_cold
## 17               Ema                  ema
## 18  OssificationCold    ossification_cold
## 19       AusMarbling         aus_marbling
## 20       MsaMarbling         msa_marbling
## 21        MeatColour          meat_colour
## 22         FatColour           fat_colour
## 23        RibfatCold          ribfat_cold
## 24                Ph                   ph
## 25          LoinTemp            loin_temp
## 26          FeedType            feed_type
## 27      NoDaysOnFeed      no_days_on_feed
## 28          MSAIndex            msa_index
## 29             spare                spare

tabyl()

tabyl() takes a vector and returns a frequency table, like table(). But its additional features are:

  • It returns a data.frame - for manipulating further, or printing with knitr::kable().
  • It automatically calculates percentages
  • It can (optionally) display NA values
    • When NA values are present, it will calculate an additional column valid_percent
  • It can (optionally) sort on counts
  • It can be called with %>% in a pipeline
  • When called on a factor, it will include missing levels in the result (levels not present in the vector)
tabyl(x, meat_colour)
##  meat_colour    n percent
##           1B   87 0.02175
##           1C  657 0.16425
##            2 1730 0.43250
##            3 1478 0.36950
##            4   30 0.00750
##            5   14 0.00350
##            6    4 0.00100

Compare to the base R approach:

table(x$meat_colour)
## 
##   1B   1C    2    3    4    5    6 
##   87  657 1730 1478   30   14    4

tabyl() can be called on a piped-in data frame, which allows for fast, flexible exploration of data:

# Load dplyr for the %>% pipe 
library(dplyr)
x %>% tabyl(meat_colour)
##  meat_colour    n percent
##           1B   87 0.02175
##           1C  657 0.16425
##            2 1730 0.43250
##            3 1478 0.36950
##            4   30 0.00750
##            5   14 0.00350
##            6    4 0.00100

We can format the percent column nicely using the adorn_pct_formatting function.

x %>% 
  tabyl(meat_colour) %>% 
  adorn_pct_formatting(digits = 0, affix_sign = TRUE)
##  meat_colour    n percent
##           1B   87      2%
##           1C  657     16%
##            2 1730     43%
##            3 1478     37%
##            4   30      1%
##            5   14      0%
##            6    4      0%

There’s a column called spare, let’s see what’s in it:

x %>% tabyl(spare)
##  spare    n percent valid_percent
##     NA 4000       1            NA

The spare column has only NA values (missing).

remove_empty()

The remove_empty() function removes any columns that are entirely empty and entire rows that are entirely empty.

x = remove_empty(x, which = c("rows","cols"))

When I’m importing data, I usually have the following pipeline:

x = read_excel("data/mymsa.xlsx") %>% 
  clean_names() %>% remove_empty()
## value for "which" not specified, defaulting to c("rows", "cols")

Crosstabulation

The tabyl() function generalises to crosstabulations of two (or more) variables. We can start by looking at the distribution of meat colour over the two plants in the data set.

x %>% tabyl(meat_colour, plant)
##  meat_colour    1   2
##           1B    0  87
##           1C   87 570
##            2 1443 287
##            3 1477   1
##            4   27   3
##            5    9   5
##            6    1   3

We can add row and/or column totals using adorn_totals()

# row totals
x %>% 
  tabyl(meat_colour, plant) %>% 
  adorn_totals(where = "row")
##  meat_colour    1   2
##           1B    0  87
##           1C   87 570
##            2 1443 287
##            3 1477   1
##            4   27   3
##            5    9   5
##            6    1   3
##        Total 3044 956
# column totals
x %>% 
  tabyl(meat_colour, plant) %>% 
  adorn_totals(where = "col")
##  meat_colour    1   2 Total
##           1B    0  87    87
##           1C   87 570   657
##            2 1443 287  1730
##            3 1477   1  1478
##            4   27   3    30
##            5    9   5    14
##            6    1   3     4
# row and column totals
x %>% 
  tabyl(meat_colour, plant) %>% 
  adorn_totals(where = c("row","col"))
##  meat_colour    1   2 Total
##           1B    0  87    87
##           1C   87 570   657
##            2 1443 287  1730
##            3 1477   1  1478
##            4   27   3    30
##            5    9   5    14
##            6    1   3     4
##        Total 3044 956  4000

We can also convert to percentages using adorn_percentages() and format nicely again using adorn_pct_formatting()

x %>% 
  tabyl(meat_colour, plant) %>% 
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "col") %>% 
  adorn_pct_formatting(digits = 0) 
##  meat_colour    1    2 Total
##           1B   0%   9%    2%
##           1C   3%  60%   16%
##            2  47%  30%   43%
##            3  49%   0%   37%
##            4   1%   0%    1%
##            5   0%   1%    0%
##            6   0%   0%    0%
##        Total 100% 100%  100%

And add back the counts using the adorn_ns() function

x %>% 
  tabyl(meat_colour, plant) %>% 
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "col") %>% 
  adorn_pct_formatting(digits = 0) %>% 
  adorn_ns(position = "front")
##  meat_colour           1          2       Total
##           1B    0   (0%)  87   (9%)   87   (2%)
##           1C   87   (3%) 570  (60%)  657  (16%)
##            2 1443  (47%) 287  (30%) 1730  (43%)
##            3 1477  (49%)   1   (0%) 1478  (37%)
##            4   27   (1%)   3   (0%)   30   (1%)
##            5    9   (0%)   5   (1%)   14   (0%)
##            6    1   (0%)   3   (0%)    4   (0%)
##        Total 3044 (100%) 956 (100%) 4000 (100%)

get_dupes()

This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn’t be any.

For example, in a tidy data frame you might expect to have a unique ID repeated for each year, and year repeated for each unique ID, but no duplicated pairs of unique ID & year. Say you want to check for their presence, and study any such duplicated records.

get_dupes() returns the records (and inserts a count of duplicates) so you can sleuth out the problematic cases.

In our data there’s no duplicate records, but we can check this. The unique identifier is rfid so we can call get_dupes on the rfid column:

x %>% get_dupes(rfid)
## No duplicate combinations found of: rfid
## # A tibble: 0 x 29
## # … with 29 variables: rfid <chr>, dupe_count <int>, plant <dbl>,
## #   kill_date <dttm>, body_no <dbl>, left_side_scan_time <dbl>,
## #   right_side_scan_time <dbl>, hang_method <chr>, hgp <chr>, sex <chr>,
## #   left_hscw <dbl>, right_hscw <dbl>, total_hscw <dbl>, p8fat <dbl>,
## #   lot <dbl>, est_percent_bi <chr>, hump_cold <dbl>, ema <dbl>,
## #   ossification_cold <dbl>, aus_marbling <dbl>, msa_marbling <dbl>,
## #   meat_colour <chr>, fat_colour <dbl>, ribfat_cold <dbl>, ph <dbl>,
## #   loin_temp <dbl>, feed_type <chr>, no_days_on_feed <dbl>,
## #   msa_index <dbl>

I’m going to artificially create duplicates so we can see what happens when there are duplicates. The code below adds 3 duplicate rows to the end.

x1 = x %>% slice(1:3)
x2 = bind_rows(x1,x)
x2 %>% get_dupes(rfid)
## # A tibble: 6 x 29
##   rfid  dupe_count plant kill_date           body_no left_side_scan_…
##   <chr>      <int> <dbl> <dttm>                <dbl>            <dbl>
## 1 201 …          2     1 2016-08-15 00:00:00     193              423
## 2 201 …          2     1 2016-08-15 00:00:00     193              423
## 3 253 …          2     1 2016-08-15 00:00:00     257              542
## 4 253 …          2     1 2016-08-15 00:00:00     257              542
## 5 818 …          2     1 2016-08-02 00:00:00      99              445
## 6 818 …          2     1 2016-08-02 00:00:00      99              445
## # … with 23 more variables: right_side_scan_time <dbl>, hang_method <chr>,
## #   hgp <chr>, sex <chr>, left_hscw <dbl>, right_hscw <dbl>,
## #   total_hscw <dbl>, p8fat <dbl>, lot <dbl>, est_percent_bi <chr>,
## #   hump_cold <dbl>, ema <dbl>, ossification_cold <dbl>,
## #   aus_marbling <dbl>, msa_marbling <dbl>, meat_colour <chr>,
## #   fat_colour <dbl>, ribfat_cold <dbl>, ph <dbl>, loin_temp <dbl>,
## #   feed_type <chr>, no_days_on_feed <dbl>, msa_index <dbl>

excel_numeric_to_date()

Ever load data from Excel and see a value like 42223 where a date should be? This function converts those serial numbers to class Date.

excel_numeric_to_date(41103)
## [1] "2012-07-13"

@garthtarr