Hidden powerful features of the tidyverse

Lately, I have had to write R code that would run from the command line (through Rscript) and whose expected behaviour was to summarise datasets that have standardized column names, but where the existing columns may differ from project to project.

In developing this code, I ran into a couple of (new to me) features of the tidyverse, and specifically the packages tidyselect and dplyr. These are:

tidyselect::any_of: variable selection by character vector that ignores missing variable.
the ability of dplyr::across to rename variables on the fly.

If you have been following the developments of the tidyverse over the last few years, you’ll know that with the release of dplyr==1.0.0, the recommended way of applying a mutating or summarising function to several columns is to use across. Let’s look at an example.

library(tidyverse)

# Compute mean of all numerical variables
iris |> 
  summarise(across(where(is.numeric), mean))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333

This approach also works with group_by. Moreover, if you want to pass arguments to the function, you should use anonymous functions:

iris |> 
  group_by(Species) |> 
  summarise(across(where(is.numeric), 
                   \(x) mean(x, na.rm = TRUE)))

## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

The use case of the functions all_of and any_of is when you have a vector containing the name of the variables you want to use with across. The difference between the two is that all_of will throw an error if you’re trying to summarize a column that doesn’t exist.

vars_to_summarize <- c('Sepal.Length', 'New.Variable')

iris |> 
  group_by(Species) |> 
  summarise(across(any_of(vars_to_summarize), 
                   \(x) mean(x, na.rm = TRUE)))

## # A tibble: 3 × 2
##   Species    Sepal.Length
##   <fct>             <dbl>
## 1 setosa             5.01
## 2 versicolor         5.94
## 3 virginica          6.59

But the following throws an error:

iris |> 
  group_by(Species) |> 
  summarise(across(all_of(vars_to_summarize), 
                   \(x) mean(x, na.rm = TRUE)))

## Error in `summarise()`:
## ℹ In argument: `across(all_of(vars_to_summarize), function(x) mean(x,
##   na.rm = TRUE))`.
## Caused by error in `all_of()`:
## ! Can't subset columns that don't exist.
## ✖ Column `New.Variable` doesn't exist.

The above has been documented elsewhere. But a nice hidden feature that I made use of is that across can rename variables on-the-fly if you pass a named vector to any_of! Here is what it looks like:

vars_to_summarize <- c(
  avg_sepal_length = 'Sepal.Length', 
  avg_new_variable = 'New.Variable'
  )

iris |> 
  group_by(Species) |> 
  summarise(across(any_of(vars_to_summarize), 
                   \(x) mean(x, na.rm = TRUE)))

## # A tibble: 3 × 2
##   Species    avg_sepal_length
##   <fct>                 <dbl>
## 1 setosa                 5.01
## 2 versicolor             5.94
## 3 virginica              6.59

I hope this little trick can be useful to others!