Lately, I have had to write R
code that would run from the command line (through Rscript
) and whose expected behaviour was to summarise datasets that have standardized column names, but where the existing columns may differ from project to project.
In developing this code, I ran into a couple of (new to me) features of the tidyverse, and specifically the packages tidyselect
and dplyr
. These are:
tidyselect::any_of
: variable selection by character vector that ignores missing variable.- the ability of
dplyr::across
to rename variables on the fly.
If you have been following the developments of the tidyverse over the last few years, you’ll know that with the release of dplyr==1.0.0
, the recommended way of applying a mutating or summarising function to several columns is to use across
. Let’s look at an example.
library(tidyverse)
# Compute mean of all numerical variables
iris |>
summarise(across(where(is.numeric), mean))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.843333 3.057333 3.758 1.199333
This approach also works with group_by
. Moreover, if you want to pass arguments to the function, you should use anonymous functions:
iris |>
group_by(Species) |>
summarise(across(where(is.numeric),
\(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
The use case of the functions all_of
and any_of
is when you have a vector containing the name of the variables you want to use with across
. The difference between the two is that all_of
will throw an error if you’re trying to summarize a column that doesn’t exist.
vars_to_summarize <- c('Sepal.Length', 'New.Variable')
iris |>
group_by(Species) |>
summarise(across(any_of(vars_to_summarize),
\(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 2
## Species Sepal.Length
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
But the following throws an error:
iris |>
group_by(Species) |>
summarise(across(all_of(vars_to_summarize),
\(x) mean(x, na.rm = TRUE)))
## Error in `summarise()`:
## ℹ In argument: `across(all_of(vars_to_summarize), function(x) mean(x,
## na.rm = TRUE))`.
## Caused by error in `all_of()`:
## ! Can't subset columns that don't exist.
## ✖ Column `New.Variable` doesn't exist.
The above has been documented elsewhere. But a nice hidden feature that I made use of is that across
can rename variables on-the-fly if you pass a named vector to any_of
! Here is what it looks like:
vars_to_summarize <- c(
avg_sepal_length = 'Sepal.Length',
avg_new_variable = 'New.Variable'
)
iris |>
group_by(Species) |>
summarise(across(any_of(vars_to_summarize),
\(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 2
## Species avg_sepal_length
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
I hope this little trick can be useful to others!