There are now 2 distinct dialects of the R programming language in the wild. The first and original dialect is typically referred to as “base” R, which derives from the base R package that comes pre-loaded as part of the standard R installation. The second, is known as the “tidyverse” (or affectionately “Hadleyverse”) was largely developed by Hadley Wickham, one of R’s most prolific R package contributors. The tidyverse is an ‘opinionated’ collection of R packages that duplicate and seek to improve upon numerous base R functions for data manipulation (e.g. dplyr) and graphing (e.g. ggplot2). As the tidyverse has grown increasing more comprehensive, it has been suggested that it be taught first to new R users. The debate between which R dialect is better has generated a lot of heat, but not much light. This talk will review the similarities (with numerous examples) between the 2 dialects and hopefully help give new and old R users some perspective.
base
tidyverse
suppressWarnings( {
library(aqp)
library(soilDB) # install from github
library(lattice)
library(tidyverse) # includes: dplyr, tidyr, ggplot2, etc...
})
# soil data for Marion County
s <- get_component_from_SDA(WHERE = "compname IN ('Miami', 'Crosby') AND majcompflag = 'Yes' AND areasymbol != 'US'")
h1 <- get_chorizon_from_SDA(WHERE = "compname IN ('Miami', 'Crosby')")
source("https://raw.githubusercontent.com/ncss-tech/soilReports/master/inst/reports/region11/lab_summary_by_taxonname/genhz_rules/Miami_rules.R")
h <- subset(h1, cokey %in% s$cokey & !grepl("H", hzname))
h$genhz <- generalize.hz(h$hzname, new = ghr$n, pat = ghr$p)
names(h) <- gsub("total", "", names(h))
h2 <- h
h <- merge(h, s[c("cokey", "compname")], by = "cokey", all.x = TRUE)
depths(h2) <- cokey ~ hzdept_r + hzdepb_r
site(h2) <- s
# examine dataset
str(h, 2)
## 'data.frame': 1376 obs. of 32 variables:
## $ cokey : int 14880219 14880219 14880219 14880222 14880222 14880222 14880223 14880223 14880223 14880225 ...
## $ hzname : chr "Ap" "BE,Bt1,Bt2" "C" "Ap" ...
## $ hzdept_r : int 0 23 79 0 23 79 0 23 79 0 ...
## $ hzdepb_r : int 23 79 152 23 79 152 23 79 152 23 ...
## $ texture : Factor w/ 21 levels "cos","s","fs",..: 13 17 13 14 17 13 14 17 13 14 ...
## $ sand_l : num NA NA NA NA NA NA NA NA NA NA ...
## $ sand_r : num 37 21 33 18 21 33 18 21 33 18 ...
## $ sand_h : num NA NA NA NA NA NA NA NA NA NA ...
## $ silt_l : num NA NA NA NA NA NA NA NA NA NA ...
## $ silt_r : num 49 43 47 64 43 47 64 43 47 64 ...
## $ silt_h : num NA NA NA NA NA NA NA NA NA NA ...
## $ clay_l : num 11 27 15 15 27 15 15 27 15 15 ...
## $ clay_r : num 14 36 20 18 36 20 18 36 20 18 ...
## $ clay_h : num 25 40 25 25 40 25 25 40 25 25 ...
## $ om_l : num 1 0.5 0.5 1 0.5 0.5 1 0.5 0.5 1 ...
## $ om_r : num 2 0.75 0.75 2.5 0.75 0.75 2.5 0.75 0.75 2 ...
## $ om_h : num 3 1 1 3 1 1 3 1 1 2.5 ...
## $ dbovendry_r: num 1.52 1.84 1.89 1.52 1.84 1.89 1.52 1.84 1.89 1.52 ...
## $ ksat_r : num 9.17 2.33 0.74 9.17 2.33 0.74 9.17 2.33 0.74 9.17 ...
## $ awc_l : num 0.18 0.04 0.02 0.18 0.04 0.02 0.18 0.04 0.02 0.18 ...
## $ awc_r : num 0.21 0.1 0.03 0.21 0.1 0.03 0.21 0.1 0.03 0.21 ...
## $ awc_h : num 0.24 0.16 0.04 0.24 0.16 0.04 0.24 0.16 0.04 0.24 ...
## $ lep_r : num 1.5 4.5 1.5 1.5 4.5 1.5 1.5 4.5 1.5 1.5 ...
## $ sar_r : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ec_r : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cec7_r : num 15 20 12 15 20 12 15 20 12 15 ...
## $ sumbases_r : num 11 14 11 11 14 11 11 14 11 11 ...
## $ ph1to1h2o_l: num 5.6 5.1 7.4 5.6 5.1 7.4 5.6 5.1 7.4 5.6 ...
## $ ph1to1h2o_r: num 6.5 6.5 7.9 6.5 6.5 7.9 6.5 6.5 7.9 6.5 ...
## $ ph1to1h2o_h: num 7.3 7.3 8.4 7.3 7.3 8.4 7.3 7.3 8.4 7.3 ...
## $ genhz : Factor w/ 8 levels "A","Ap","E","Bt",..: 2 4 8 2 4 8 2 4 8 2 ...
## $ compname : chr "Crosby" "Crosby" "Crosby" "Crosby" ...
# plot dataset
plot(h2[1:10], label = "compname", name = "genhz", color = "clay_r")
In a lot of cases the tidyverse has different defaults for similarly named functions.
fp <- "C:/workspace2/test.csv"
write.csv(s, file = fp, row.names = FALSE)
s1 <- read.csv(file = fp)
str(s1$drainagecl)
## Factor w/ 3 levels "Moderately well drained",..: 3 1 1 1 1 1 1 1 1 1 ...
# base option 1
s_b <- read.csv(file = fp, stringsAsFactors = FALSE)
str(s_b$drainagecl)
## chr [1:439] "Well drained" "Moderately well drained" ...
# base option 2
options(stringsAsFactors = FALSE)
s_b <- read.csv(file = fp)
str(s_b$drainagecl)
## chr [1:439] "Well drained" "Moderately well drained" ...
# tidyverse -readr
s_t <- read_csv(file = fp) # notice the output is a data.frame
str(s_t$drainagecl)
## chr [1:439] "Well drained" "Moderately well drained" ...
# base
head(s_b) # or
## nationalmusym compname comppct_r compkind majcompflag localphase slope_r
## 1 5vwy Miami 95 Series Yes <NA> 8
## 2 1qg3m Miami 46 Series Yes <NA> 24
## 3 1nyrq Miami 90 Series Yes <NA> 14
## 4 nw15 Miami 46 Series Yes <NA> 14
## 5 nw16 Miami 46 Series Yes <NA> 24
## 6 nvfl Miami 90 Series Yes <NA> 3
## tfact wei weg drainagecl elev_r aspectrep map_r airtempa_r
## 1 4 56 5 Well drained 200 190 1060 12
## 2 3 56 5 Moderately well drained 230 110 890 11
## 3 3 48 6 Moderately well drained 230 200 889 11
## 4 3 48 6 Moderately well drained 230 200 889 11
## 5 3 56 5 Moderately well drained 230 110 890 11
## 6 3 48 6 Moderately well drained 250 200 875 11
## reannualprecip_r ffd_r nirrcapcl nirrcapscl irrcapcl irrcapscl frostact
## 1 NA 183 4 e NA NA Moderate
## 2 NA 170 6 e NA NA Moderate
## 3 NA 170 4 e NA NA Moderate
## 4 NA 170 4 e NA NA Moderate
## 5 NA 170 6 e NA NA Moderate
## 6 NA 170 2 e NA NA Moderate
## hydgrp corcon corsteel
## 1 C Low Low
## 2 C Moderate High
## 3 C Moderate High
## 4 C Moderate High
## 5 C Moderate High
## 6 C Moderate High
## taxclname taxorder
## 1 OXYAQUIC HAPLUDALFS, FINE-LOAMY, MIXED, MESIC Alfisols
## 2 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 3 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 4 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 5 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## 6 Fine-loamy, mixed, active, mesic Oxyaquic Hapludalfs Alfisols
## taxsuborder taxgrtgroup taxsubgrp taxpartsize taxpartsizemod
## 1 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## 2 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## 3 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## 4 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## 5 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## 6 Udalfs Hapludalfs Oxyaquic Hapludalfs fine-loamy <NA>
## taxceactcl taxreaction taxtempcl taxmoistscl taxtempregime
## 1 <NA> not used mesic <NA> mesic
## 2 active not used mesic <NA> mesic
## 3 active not used mesic <NA> mesic
## 4 active not used mesic <NA> mesic
## 5 active not used mesic <NA> mesic
## 6 active not used mesic <NA> mesic
## soiltaxedition cokey
## 1 fifth edition 14758492
## 2 ninth edition 14758736
## 3 ninth edition 14758738
## 4 eighth edition 14767264
## 5 ninth edition 14767267
## 6 ninth edition 14767268
# print(s_b) # prints the whole table
# tidyverse
head(s_t) # or
## # A tibble: 6 x 39
## nationalmusym compname comppct_r compkind majcompflag localphase slope_r
## <chr> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 5vwy Miami 95 Series Yes <NA> 8
## 2 1qg3m Miami 46 Series Yes <NA> 24
## 3 1nyrq Miami 90 Series Yes <NA> 14
## 4 nw15 Miami 46 Series Yes <NA> 14
## 5 nw16 Miami 46 Series Yes <NA> 24
## 6 nvfl Miami 90 Series Yes <NA> 3
## # ... with 32 more variables: tfact <int>, wei <int>, weg <int>,
## # drainagecl <chr>, elev_r <dbl>, aspectrep <int>, map_r <int>,
## # airtempa_r <dbl>, reannualprecip_r <chr>, ffd_r <int>,
## # nirrcapcl <int>, nirrcapscl <chr>, irrcapcl <chr>, irrcapscl <chr>,
## # frostact <chr>, hydgrp <chr>, corcon <chr>, corsteel <chr>,
## # taxclname <chr>, taxorder <chr>, taxsuborder <chr>, taxgrtgroup <chr>,
## # taxsubgrp <chr>, taxpartsize <chr>, taxpartsizemod <chr>,
## # taxceactcl <chr>, taxreaction <chr>, taxtempcl <chr>,
## # taxmoistscl <chr>, taxtempregime <chr>, soiltaxedition <chr>,
## # cokey <int>
# print(s_t) # prints the first 10 rows
## square brackets using column names
summary(h[, "clay_r"], na.rm = TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 17.00 23.00 22.81 30.00 40.00
# square brackets using logical indices
idx <- names(h) %in% "clay_r"
summary(h[, idx], na.rm = TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 17.00 23.00 22.81 30.00 40.00
# square brackets using column indices
which(idx)
## [1] 13
summary(h[, 12], na.rm = TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 15.00 16.42 24.00 35.00
## $ operator
summary(h$clay_r, na.rm = TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 17.00 23.00 22.81 30.00 40.00
Non-standard evaluation (NSE) allows you access columns within a data.frame without the need to repeatedly specifying the data.frame. This is particularly useful with long data.frame object names (e.g. soil_horizons vs h) and many calls to different columns. The tidyverse implements NSE by default. Base R has a few functions like ‘with()’ and ‘attach()’ that facilitate NSE evaluation, but few functions implement it by default. NSE is somewhat contensious because it can have unintended consequences if you have objects and columns with the same name. As such, NSE is generally meant for interactive analysis, and not programming.
# base option 1
with(h, { data.frame(
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)})
## min mean max
## 1 3 22.80996 40
# base option 2
attach(h)
data.frame(
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)
## min mean max
## 1 3 22.80996 40
detach(h)
# tidyverse non-standard evaluation (enabled by default) - dplyr
summarize(h,
min = min(clay_r, na.rm = TRUE),
mean = mean(clay_r, na.rm = TRUE),
max = max(clay_r, na.rm = TRUE)
)
## min mean max
## 1 3 22.80996 40
# base R
sub_b <- subset(h, genhz == "Ap")
dim(sub_b)
## [1] 211 32
# tidyverse - dplyr
sub_t <- filter(h, genhz == "Ap")
dim(sub_t)
## [1] 211 32
# base
with(h, h[order(cokey, hzdept_r), ])[1:4, 1:4]
## cokey hzname hzdept_r hzdepb_r
## 1 14880219 Ap 0 23
## 2 14880219 BE,Bt1,Bt2 23 79
## 3 14880219 C 79 152
## 4 14880222 Ap 0 23
# tidyverse - dplyr
arrange(h, cokey, hzdept_r)[1:4, 1:4]
## cokey hzname hzdept_r hzdepb_r
## 1 14880219 Ap 0 23
## 2 14880219 BE,Bt1,Bt2 23 79
## 3 14880219 C 79 152
## 4 14880222 Ap 0 23
Referred too as ‘syntactic’ sugar, pipping is supposed to make code more readable, by making if read from left to right, rather than from inside out. This becomes particularly valuable when 3 or more functions combined. It also alleviates the need to overwrite existing objects.
# base
pip_b <- {subset(s, drainagecl == "Well drained") ->.;
.[order(.$nationalmusym), ]
}
pip_b[1:4, 1:4]
## nationalmusym compname comppct_r compkind
## 330 1j665 Miami 40 Taxadjunct
## 328 1jt03 Miami 40 Taxadjunct
## 17 1ns3v Miami 95 Taxadjunct
## 324 1qgv5 Miami 60 Taxadjunct
# tidyverse
pip_t <- filter(s, drainagecl == "Well drained") %>%
arrange(nationalmusym)
pip_t[1:4, 1:4]
## nationalmusym compname comppct_r compkind
## 1 1j665 Miami 40 Taxadjunct
## 2 1jt03 Miami 40 Taxadjunct
## 3 1ns3v Miami 95 Taxadjunct
## 4 1qgv5 Miami 60 Taxadjunct
In lots of cases we want to know the variation within groups.
# base
vars <- c("compname", "genhz")
sca_b <- {
split(h, h[vars], drop = TRUE) ->.; # split
lapply(., function(x) data.frame( # apply
x[vars][1, ],
clay_min = round(min(x$clay_r, na.rm =TRUE)),
clay_mean = round(mean(x$clay_r, na.rm = TRUE)),
clay_max = round(max(x$clay_r, na.rm = TRUE))
)) ->.;
do.call("rbind", .) # combine
}
print(sca_b)
## compname genhz clay_min clay_mean clay_max
## Crosby.A Crosby A 18 18 18
## Miami.A Miami A 14 20 30
## Crosby.Ap Crosby Ap 14 18 18
## Miami.Ap Miami Ap 6 18 31
## Crosby.E Crosby E 17 17 20
## Miami.E Miami E 15 19 30
## Crosby.Bt Crosby Bt 24 27 36
## Miami.Bt Miami Bt 24 30 35
## Crosby.2Bt Crosby 2Bt 31 37 40
## Miami.2Bt Miami 2Bt 23 30 31
## Crosby.2BCt Crosby 2BCt 23 25 36
## Miami.2BCt Miami 2BCt 10 23 28
## Crosby.2Cd Crosby 2Cd 15 17 20
## Miami.2Cd Miami 2Cd 3 16 23
## Crosby.not-used Crosby not-used 15 26 36
## Miami.not-used Miami not-used 3 21 31
# tidyverse - dplyr
sca_t <- group_by(h, compname, genhz) %>% # split (sort of)
summarize( # apply and combine
clay_min = round(min(clay_r, na.rm =TRUE)),
clay_mean = round(mean(clay_r, na.rm = TRUE)),
clay_max = round(max(clay_r, na.rm = TRUE))
)
print(sca_t)
## # A tibble: 16 x 5
## # Groups: compname [?]
## compname genhz clay_min clay_mean clay_max
## <chr> <fctr> <dbl> <dbl> <dbl>
## 1 Crosby A 18 18 18
## 2 Crosby Ap 14 18 18
## 3 Crosby E 17 17 20
## 4 Crosby Bt 24 27 36
## 5 Crosby 2Bt 31 37 40
## 6 Crosby 2BCt 23 25 36
## 7 Crosby 2Cd 15 17 20
## 8 Crosby not-used 15 26 36
## 9 Miami A 14 20 30
## 10 Miami Ap 6 18 31
## 11 Miami E 15 19 30
## 12 Miami Bt 24 30 35
## 13 Miami 2Bt 23 30 31
## 14 Miami 2BCt 10 23 28
## 15 Miami 2Cd 3 16 23
## 16 Miami not-used 3 21 31
In lots of instances, particularly for graphing, it’s necessary to convert the a data.frame from wide to long format.
# base wide to long
vars <- c("clay_r", "sand_r", "om_r")
idvars <- c("compname", "genhz")
head(h[c(idvars, vars)])
## compname genhz clay_r sand_r om_r
## 1 Crosby Ap 14 37 2.00
## 2 Crosby Bt 36 21 0.75
## 3 Crosby not-used 20 33 0.75
## 4 Crosby Ap 18 18 2.50
## 5 Crosby Bt 36 21 0.75
## 6 Crosby not-used 20 33 0.75
lo_b <- reshape(h[c("compname", "genhz", vars)], # need to exclude unused columns
direction = "long",
timevar = "variable", times = vars, # capture names of variables in variable column
v.names = "value", varying = vars # capture values of variables in value column
)
head(lo_b) # notice the row.names
## compname genhz variable value id
## 1.clay_r Crosby Ap clay_r 14 1
## 2.clay_r Crosby Bt clay_r 36 2
## 3.clay_r Crosby not-used clay_r 20 3
## 4.clay_r Crosby Ap clay_r 18 4
## 5.clay_r Crosby Bt clay_r 36 5
## 6.clay_r Crosby not-used clay_r 20 6
# tidyverse wide to long
idx <- which(names(h) %in% vars)
lo_t <- select(h, compname, idx) %>% # need to exclude unused columns
gather(key = variable,
value = value,
- compname
)
head(lo_t)
## compname variable value
## 1 Crosby sand_r 37
## 2 Crosby sand_r 21
## 3 Crosby sand_r 33
## 4 Crosby sand_r 18
## 5 Crosby sand_r 21
## 6 Crosby sand_r 33
# sort factors
comp_sort <- aggregate(value ~ compname, data = lo_b[lo_b$variable == "clay_r", ], median, na.rm = TRUE)
comp_sort <- comp_sort[order(comp_sort$value), ]
lo_b <- within(lo_b, {
compname = factor(lo_b$compname, levels = comp_sort$compname)
genhz = factor(genhz, levels = rev(levels(genhz)))
})
# lattice density plot
bwplot(genhz ~ value | variable + compname,
data = lo_b,
scales = list(x = "free")
)
# ggplot2 density plot
ggplot(lo_b, aes(x = genhz, y = value)) + # ggplot2 doesn't like factors or strings on the y-axis
geom_boxplot() + # notice ggplot2 pipes is "+" not "%>%"
facet_wrap(~ compname + variable, scales = "free_x") +
coord_flip()
The tidyverse and it’s precusors plyr and reshape2, introduced me to a lot of cool ways of manipulating data in new ways, and made me question ‘how I would do that in base’.
base
tidyverse