base R vs the tidyverse

Abstract

There are now 2 distinct dialects of the R programming language in the wild. The first and original dialect is typically referred to as “base” R, which derives from the base R package that comes pre-loaded as part of the standard R installation. The second, is known as the “tidyverse” (or affectionately “Hadleyverse”) was largely developed by Hadley Wickham, one of R’s most prolific R package contributors. The tidyverse is an ‘opinionated’ collection of R packages that duplicate and seek to improve upon numerous base R functions for data manipulation (e.g. dplyr) and graphing (e.g. ggplot2). As the tidyverse has grown increasing more comprehensive, it has been suggested that it be taught first to new R users. The debate between which R dialect is better has generated a lot of heat, but not much light. This talk will review the similarities (with numerous examples) between the 2 dialects and hopefully help give new and old R users some perspective.

Comparison

base

base R is more closely associated with “Statistics”, with focus on statistical methods not data manipulation methods
variable syntax
multiple ways to accomplish the same thing
vector is probably the primary object
base functions may produce multiple outputs based on arguments
stable
lots of books and examples, but fewer free books

tidyverse

tidyverse more closely associated with “Data Science”, with focus on data manipulation methods
consistent synatx (e.g. data argument always comes first)
fewer ways to accomplish the same thing
data.frame is the primary object
tidyverse functions produce a single output
unstable
lots of free books (e.g. R for Data Science) and examples of tidyverse

Load Packages

suppressWarnings( {
  library(aqp)
  library(soilDB) # install from github
  library(lattice)
  library(tidyverse) # includes: dplyr, tidyr, ggplot2, etc...
  })

Toy Soil Dataset

# soil data for Marion County
s <- get_component_from_SDA(WHERE = "compname IN ('Miami', 'Crosby') AND majcompflag = 'Yes' AND areasymbol != 'US'")
h1 <- get_chorizon_from_SDA(WHERE = "compname IN ('Miami', 'Crosby')")
source("https://raw.githubusercontent.com/ncss-tech/soilReports/master/inst/reports/region11/lab_summary_by_taxonname/genhz_rules/Miami_rules.R")

h <- subset(h1, cokey %in% s$cokey & !grepl("H", hzname))
h$genhz <- generalize.hz(h$hzname, new = ghr$n, pat = ghr$p)
names(h) <- gsub("total", "", names(h))
h2 <- h

h <- merge(h, s[c("cokey", "compname")], by = "cokey", all.x = TRUE)

depths(h2) <- cokey ~ hzdept_r + hzdepb_r
site(h2) <- s

# examine dataset

str(h, 2)

## 'data.frame':    1376 obs. of  32 variables:
##  $ cokey      : int  14880219 14880219 14880219 14880222 14880222 14880222 14880223 14880223 14880223 14880225 ...
##  $ hzname     : chr  "Ap" "BE,Bt1,Bt2" "C" "Ap" ...
##  $ hzdept_r   : int  0 23 79 0 23 79 0 23 79 0 ...
##  $ hzdepb_r   : int  23 79 152 23 79 152 23 79 152 23 ...
##  $ texture    : Factor w/ 21 levels "cos","s","fs",..: 13 17 13 14 17 13 14 17 13 14 ...
##  $ sand_l     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ sand_r     : num  37 21 33 18 21 33 18 21 33 18 ...
##  $ sand_h     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ silt_l     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ silt_r     : num  49 43 47 64 43 47 64 43 47 64 ...
##  $ silt_h     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ clay_l     : num  11 27 15 15 27 15 15 27 15 15 ...
##  $ clay_r     : num  14 36 20 18 36 20 18 36 20 18 ...
##  $ clay_h     : num  25 40 25 25 40 25 25 40 25 25 ...
##  $ om_l       : num  1 0.5 0.5 1 0.5 0.5 1 0.5 0.5 1 ...
##  $ om_r       : num  2 0.75 0.75 2.5 0.75 0.75 2.5 0.75 0.75 2 ...
##  $ om_h       : num  3 1 1 3 1 1 3 1 1 2.5 ...
##  $ dbovendry_r: num  1.52 1.84 1.89 1.52 1.84 1.89 1.52 1.84 1.89 1.52 ...
##  $ ksat_r     : num  9.17 2.33 0.74 9.17 2.33 0.74 9.17 2.33 0.74 9.17 ...
##  $ awc_l      : num  0.18 0.04 0.02 0.18 0.04 0.02 0.18 0.04 0.02 0.18 ...
##  $ awc_r      : num  0.21 0.1 0.03 0.21 0.1 0.03 0.21 0.1 0.03 0.21 ...
##  $ awc_h      : num  0.24 0.16 0.04 0.24 0.16 0.04 0.24 0.16 0.04 0.24 ...
##  $ lep_r      : num  1.5 4.5 1.5 1.5 4.5 1.5 1.5 4.5 1.5 1.5 ...
##  $ sar_r      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ec_r       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cec7_r     : num  15 20 12 15 20 12 15 20 12 15 ...
##  $ sumbases_r : num  11 14 11 11 14 11 11 14 11 11 ...
##  $ ph1to1h2o_l: num  5.6 5.1 7.4 5.6 5.1 7.4 5.6 5.1 7.4 5.6 ...
##  $ ph1to1h2o_r: num  6.5 6.5 7.9 6.5 6.5 7.9 6.5 6.5 7.9 6.5 ...
##  $ ph1to1h2o_h: num  7.3 7.3 8.4 7.3 7.3 8.4 7.3 7.3 8.4 7.3 ...
##  $ genhz      : Factor w/ 8 levels "A","Ap","E","Bt",..: 2 4 8 2 4 8 2 4 8 2 ...
##  $ compname   : chr  "Crosby" "Crosby" "Crosby" "Crosby" ...

# plot dataset

plot(h2[1:10], label = "compname", name = "genhz", color = "clay_r")

Options

In a lot of cases the tidyverse has different defaults for similarly named functions.

strings vs factors

fp <- "C:/workspace2/test.csv"
write.csv(s, file = fp, row.names = FALSE)

s1 <- read.csv(file = fp)
str(s1$drainagecl)

##  Factor w/ 3 levels "Moderately well drained",..: 3 1 1 1 1 1 1 1 1 1 ...

# base option 1
s_b <- read.csv(file = fp, stringsAsFactors = FALSE)
str(s_b$drainagecl)

##  chr [1:439] "Well drained" "Moderately well drained" ...

# base option 2
options(stringsAsFactors = FALSE)
s_b <- read.csv(file = fp)
str(s_b$drainagecl)

##  chr [1:439] "Well drained" "Moderately well drained" ...

# tidyverse -readr
s_t <- read_csv(file = fp) # notice the output is a data.frame
str(s_t$drainagecl)

##  chr [1:439] "Well drained" "Moderately well drained" ...

Printing

# base
head(s_b) # or

##   nationalmusym compname comppct_r compkind majcompflag localphase slope_r
## 1          5vwy    Miami        95   Series         Yes       <NA>       8
## 2         1qg3m    Miami        46   Series         Yes       <NA>      24
## 3         1nyrq    Miami        90   Series         Yes       <NA>      14
## 4          nw15    Miami        46   Series         Yes       <NA>      14
## 5          nw16    Miami        46   Series         Yes       <NA>      24
## 6          nvfl    Miami        90   Series         Yes       <NA>       3
##   tfact wei weg              drainagecl elev_r aspectrep map_r airtempa_r
## 1     4  56   5            Well drained    200       190  1060         12
## 2     3  56   5 Moderately well drained    230       110   890         11
## 3     3  48   6 Moderately well drained    230       200   889         11
## 4     3  48   6 Moderately well drained    230       200   889         11
## 5     3  56   5 Moderately well drained    230       110   890         11
## 6     3  48   6 Moderately well drained    250       200   875         11
##   reannualprecip_r ffd_r nirrcapcl nirrcapscl irrcapcl irrcapscl frostact
## 1               NA   183         4          e       NA        NA Moderate
## 2               NA   170         6          e       NA        NA Moderate
## 3               NA   170         4          e       NA        NA Moderate
## 4               NA   170         4          e       NA        NA Moderate
## 5               NA   170         6          e       NA        NA Moderate
## 6               NA   170         2          e       NA        NA Moderate
##   hydgrp   corcon corsteel
## 1      C      Low      Low
## 2      C Moderate     High
## 3      C Moderate     High
## 4      C Moderate     High
## 5      C Moderate     High
## 6      C Moderate     High
##                                              taxclname taxorder
## 1        OXYAQUIC HAPLUDALFS, FINE-LOAMY, MIXED, MESIC Alfisols
## 2 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 3 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 4 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 5 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 6 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
##   taxsuborder taxgrtgroup           taxsubgrp taxpartsize taxpartsizemod
## 1      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
## 2      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
## 3      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
## 4      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
## 5      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
## 6      Udalfs  Hapludalfs Oxyaquic Hapludalfs  fine-loamy           <NA>
##   taxceactcl taxreaction taxtempcl taxmoistscl taxtempregime
## 1       <NA>    not used     mesic        <NA>         mesic
## 2     active    not used     mesic        <NA>         mesic
## 3     active    not used     mesic        <NA>         mesic
## 4     active    not used     mesic        <NA>         mesic
## 5     active    not used     mesic        <NA>         mesic
## 6     active    not used     mesic        <NA>         mesic
##   soiltaxedition    cokey
## 1  fifth edition 14758492
## 2  ninth edition 14758736
## 3  ninth edition 14758738
## 4 eighth edition 14767264
## 5  ninth edition 14767267
## 6  ninth edition 14767268

# print(s_b) # prints the whole table

# tidyverse
head(s_t) # or

## # A tibble: 6 x 39
##   nationalmusym compname comppct_r compkind majcompflag localphase slope_r
##           <chr>    <chr>     <int>    <chr>       <chr>      <chr>   <dbl>
## 1          5vwy    Miami        95   Series         Yes       <NA>       8
## 2         1qg3m    Miami        46   Series         Yes       <NA>      24
## 3         1nyrq    Miami        90   Series         Yes       <NA>      14
## 4          nw15    Miami        46   Series         Yes       <NA>      14
## 5          nw16    Miami        46   Series         Yes       <NA>      24
## 6          nvfl    Miami        90   Series         Yes       <NA>       3
## # ... with 32 more variables: tfact <int>, wei <int>, weg <int>,
## #   drainagecl <chr>, elev_r <dbl>, aspectrep <int>, map_r <int>,
## #   airtempa_r <dbl>, reannualprecip_r <chr>, ffd_r <int>,
## #   nirrcapcl <int>, nirrcapscl <chr>, irrcapcl <chr>, irrcapscl <chr>,
## #   frostact <chr>, hydgrp <chr>, corcon <chr>, corsteel <chr>,
## #   taxclname <chr>, taxorder <chr>, taxsuborder <chr>, taxgrtgroup <chr>,
## #   taxsubgrp <chr>, taxpartsize <chr>, taxpartsizemod <chr>,
## #   taxceactcl <chr>, taxreaction <chr>, taxtempcl <chr>,
## #   taxmoistscl <chr>, taxtempregime <chr>, soiltaxedition <chr>,
## #   cokey <int>

# print(s_t) # prints the first 10 rows

Standard Evaluation with base R

## square brackets using column names
summary(h[, "clay_r"], na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   17.00   23.00   22.81   30.00   40.00

# square brackets using logical indices
idx <- names(h) %in% "clay_r"
summary(h[, idx], na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   17.00   23.00   22.81   30.00   40.00

# square brackets using column indices
which(idx)

## [1] 13

summary(h[, 12], na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   15.00   16.42   24.00   35.00

## $ operator
summary(h$clay_r, na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   17.00   23.00   22.81   30.00   40.00

Non-Standard Evaluation (NSE)

Non-standard evaluation (NSE) allows you access columns within a data.frame without the need to repeatedly specifying the data.frame. This is particularly useful with long data.frame object names (e.g. soil_horizons vs h) and many calls to different columns. The tidyverse implements NSE by default. Base R has a few functions like ‘with()’ and ‘attach()’ that facilitate NSE evaluation, but few functions implement it by default. NSE is somewhat contensious because it can have unintended consequences if you have objects and columns with the same name. As such, NSE is generally meant for interactive analysis, and not programming.

# base option 1
with(h, { data.frame(
  min = min(clay_r, na.rm = TRUE),
  mean = mean(clay_r, na.rm = TRUE), 
  max = max(clay_r, na.rm = TRUE)
  )})

##   min     mean max
## 1   3 22.80996  40

# base option 2
attach(h)
data.frame(
  min = min(clay_r, na.rm = TRUE),
  mean = mean(clay_r, na.rm = TRUE), 
  max = max(clay_r, na.rm = TRUE)
  )

##   min     mean max
## 1   3 22.80996  40

detach(h)

# tidyverse non-standard evaluation (enabled by default) - dplyr
summarize(h,
          min = min(clay_r, na.rm = TRUE), 
          mean = mean(clay_r, na.rm = TRUE), 
          max = max(clay_r, na.rm = TRUE)
          )

##   min     mean max
## 1   3 22.80996  40

Subsetting vs Filtering

# base R
sub_b <- subset(h, genhz == "Ap")

dim(sub_b)

## [1] 211  32

# tidyverse - dplyr
sub_t <- filter(h, genhz == "Ap")

dim(sub_t)

## [1] 211  32

Ordering vs Arranging

# base
with(h, h[order(cokey, hzdept_r), ])[1:4, 1:4]

##      cokey     hzname hzdept_r hzdepb_r
## 1 14880219         Ap        0       23
## 2 14880219 BE,Bt1,Bt2       23       79
## 3 14880219          C       79      152
## 4 14880222         Ap        0       23

# tidyverse - dplyr
arrange(h, cokey, hzdept_r)[1:4, 1:4]

##      cokey     hzname hzdept_r hzdepb_r
## 1 14880219         Ap        0       23
## 2 14880219 BE,Bt1,Bt2       23       79
## 3 14880219          C       79      152
## 4 14880222         Ap        0       23

Pipping

Referred too as ‘syntactic’ sugar, pipping is supposed to make code more readable, by making if read from left to right, rather than from inside out. This becomes particularly valuable when 3 or more functions combined. It also alleviates the need to overwrite existing objects.

# base
pip_b <- {subset(s, drainagecl == "Well drained") ->.;
  .[order(.$nationalmusym), ]
  }
pip_b[1:4, 1:4]

##     nationalmusym compname comppct_r   compkind
## 330         1j665    Miami        40 Taxadjunct
## 328         1jt03    Miami        40 Taxadjunct
## 17          1ns3v    Miami        95 Taxadjunct
## 324         1qgv5    Miami        60 Taxadjunct

# tidyverse
pip_t <- filter(s, drainagecl == "Well drained") %>% 
  arrange(nationalmusym)
pip_t[1:4, 1:4]

##   nationalmusym compname comppct_r   compkind
## 1         1j665    Miami        40 Taxadjunct
## 2         1jt03    Miami        40 Taxadjunct
## 3         1ns3v    Miami        95 Taxadjunct
## 4         1qgv5    Miami        60 Taxadjunct

Split-Combine-Apply

In lots of cases we want to know the variation within groups.

# base
vars <- c("compname", "genhz")
sca_b <- {
  split(h, h[vars], drop = TRUE) ->.;                 # split
  lapply(., function(x) data.frame(                   # apply 
    x[vars][1, ],
    clay_min  = round(min(x$clay_r, na.rm =TRUE)),
    clay_mean = round(mean(x$clay_r, na.rm = TRUE)),
    clay_max  = round(max(x$clay_r, na.rm = TRUE))
    )) ->.;
  do.call("rbind", .)                                  # combine
  }
print(sca_b)

##                 compname    genhz clay_min clay_mean clay_max
## Crosby.A          Crosby        A       18        18       18
## Miami.A            Miami        A       14        20       30
## Crosby.Ap         Crosby       Ap       14        18       18
## Miami.Ap           Miami       Ap        6        18       31
## Crosby.E          Crosby        E       17        17       20
## Miami.E            Miami        E       15        19       30
## Crosby.Bt         Crosby       Bt       24        27       36
## Miami.Bt           Miami       Bt       24        30       35
## Crosby.2Bt        Crosby      2Bt       31        37       40
## Miami.2Bt          Miami      2Bt       23        30       31
## Crosby.2BCt       Crosby     2BCt       23        25       36
## Miami.2BCt         Miami     2BCt       10        23       28
## Crosby.2Cd        Crosby      2Cd       15        17       20
## Miami.2Cd          Miami      2Cd        3        16       23
## Crosby.not-used   Crosby not-used       15        26       36
## Miami.not-used     Miami not-used        3        21       31

# tidyverse - dplyr
sca_t <- group_by(h, compname, genhz) %>%              # split (sort of)
  summarize(                                           # apply and combine
    clay_min  = round(min(clay_r, na.rm =TRUE)),
    clay_mean = round(mean(clay_r, na.rm = TRUE)),
    clay_max  = round(max(clay_r, na.rm = TRUE))
  )
print(sca_t)

## # A tibble: 16 x 5
## # Groups:   compname [?]
##    compname    genhz clay_min clay_mean clay_max
##       <chr>   <fctr>    <dbl>     <dbl>    <dbl>
##  1   Crosby        A       18        18       18
##  2   Crosby       Ap       14        18       18
##  3   Crosby        E       17        17       20
##  4   Crosby       Bt       24        27       36
##  5   Crosby      2Bt       31        37       40
##  6   Crosby     2BCt       23        25       36
##  7   Crosby      2Cd       15        17       20
##  8   Crosby not-used       15        26       36
##  9    Miami        A       14        20       30
## 10    Miami       Ap        6        18       31
## 11    Miami        E       15        19       30
## 12    Miami       Bt       24        30       35
## 13    Miami      2Bt       23        30       31
## 14    Miami     2BCt       10        23       28
## 15    Miami      2Cd        3        16       23
## 16    Miami not-used        3        21       31

Reshaping

In lots of instances, particularly for graphing, it’s necessary to convert the a data.frame from wide to long format.

# base wide to long
vars <- c("clay_r", "sand_r", "om_r")
idvars <- c("compname", "genhz")
head(h[c(idvars, vars)])

##   compname    genhz clay_r sand_r om_r
## 1   Crosby       Ap     14     37 2.00
## 2   Crosby       Bt     36     21 0.75
## 3   Crosby not-used     20     33 0.75
## 4   Crosby       Ap     18     18 2.50
## 5   Crosby       Bt     36     21 0.75
## 6   Crosby not-used     20     33 0.75

lo_b <- reshape(h[c("compname", "genhz", vars)],      # need to exclude unused columns
                  direction = "long",
                  timevar = "variable", times = vars, # capture names of variables in variable column
                  v.names = "value", varying = vars   # capture values of variables in value column
                  )
head(lo_b) # notice the row.names

##          compname    genhz variable value id
## 1.clay_r   Crosby       Ap   clay_r    14  1
## 2.clay_r   Crosby       Bt   clay_r    36  2
## 3.clay_r   Crosby not-used   clay_r    20  3
## 4.clay_r   Crosby       Ap   clay_r    18  4
## 5.clay_r   Crosby       Bt   clay_r    36  5
## 6.clay_r   Crosby not-used   clay_r    20  6

# tidyverse wide to long
idx <- which(names(h) %in% vars)
lo_t <- select(h, compname, idx) %>%                   # need to exclude unused columns
  gather(key = variable, 
         value = value,
         - compname
         )

head(lo_t)

##   compname variable value
## 1   Crosby   sand_r    37
## 2   Crosby   sand_r    21
## 3   Crosby   sand_r    33
## 4   Crosby   sand_r    18
## 5   Crosby   sand_r    21
## 6   Crosby   sand_r    33

# sort factors
comp_sort <- aggregate(value ~ compname, data = lo_b[lo_b$variable == "clay_r", ], median, na.rm = TRUE)
comp_sort <- comp_sort[order(comp_sort$value), ]
lo_b <- within(lo_b, {
  compname = factor(lo_b$compname, levels = comp_sort$compname)
  genhz = factor(genhz, levels = rev(levels(genhz)))
  })


# lattice density plot
bwplot(genhz ~ value | variable + compname, 
       data = lo_b,
       scales = list(x = "free")
       )

# ggplot2 density plot
ggplot(lo_b, aes(x = genhz, y = value)) + # ggplot2 doesn't like factors or strings on the y-axis
  geom_boxplot() +                        # notice ggplot2 pipes is "+" not "%>%"
  facet_wrap(~ compname + variable, scales = "free_x") +
  coord_flip()

Conclusion

The tidyverse and it’s precusors plyr and reshape2, introduced me to a lot of cool ways of manipulating data in new ways, and made me question ‘how I would do that in base’.