Why use purrr::map instead of lapply?

Solution 1:

If the only function you're using from purrr is map(), then no, the advantages are not substantial. As Rich Pauloo points out, the main advantage of map() is the helpers which allow you to write compact code for common special cases:

  • ~ . + 1 is equivalent to function(x) x + 1

  • list("x", 1) is equivalent to function(x) x[["x"]][[1]]. These helpers are a bit more general than [[ - see ?pluck for details. For data rectangling, the .default argument is particularly helpful.

But most of the time you're not using a single *apply()/map() function, you're using a bunch of them, and the advantage of purrr is much greater consistency between the functions. For example:

  • The first argument to lapply() is the data; the first argument to mapply() is the function. The first argument to all map functions is always the data.

  • With vapply(), sapply(), and mapply() you can choose to suppress names on the output with USE.NAMES = FALSE; but lapply() doesn't have that argument.

  • There's no consistent way to pass consistent arguments on to the mapper function. Most functions use ... but mapply() uses MoreArgs (which you'd expect to be called MORE.ARGS), and Map(), Filter() and Reduce() expect you to create a new anonymous function. In map functions, constant argument always come after the function name.

  • Almost every purrr function is type stable: you can predict the output type exclusively from the function name. This is not true for sapply() or mapply(). Yes, there is vapply(); but there's no equivalent for mapply().

You may think that all of these minor distinctions are not important (just as some people think that there's no advantage to stringr over base R regular expressions), but in my experience they cause unnecessary friction when programming (the differing argument orders always used to trip me up), and they make functional programming techniques harder to learn because as well as the big ideas, you also have to learn a bunch of incidental details.

Purrr also fills in some handy map variants that are absent from base R:

  • modify() preserves the type of the data using [[<- to modify "in place". In conjunction with the _if variant this allows for (IMO beautiful) code like modify_if(df, is.factor, as.character)

  • map2() allows you to map simultaneously over x and y. This makes it easier to express ideas like map2(models, datasets, predict)

  • imap() allows you to map simultaneously over x and its indices (either names or positions). This is makes it easy to (e.g) load all csv files in a directory, adding a filename column to each.

    dir("\\.csv$") %>%
      set_names() %>%
      map(read.csv) %>%
      imap(~ transform(.x, filename = .y))
    
  • walk() returns its input invisibly; and is useful when you're calling a function for its side-effects (i.e. writing files to disk).

Not to mention the other helpers like safely() and partial().

Personally, I find that when I use purrr, I can write functional code with less friction and greater ease; it decreases the gap between thinking up an idea and implementing it. But your mileage may vary; there's no need to use purrr unless it actually helps you.

Microbenchmarks

Yes, map() is slightly slower than lapply(). But the cost of using map() or lapply() is driven by what you're mapping, not the overhead of performing the loop. The microbenchmark below suggests that the cost of map() compared to lapply() is around 40 ns per element, which seems unlikely to materially impact most R code.

library(purrr)
n <- 1e4
x <- 1:n
f <- function(x) NULL

mb <- microbenchmark::microbenchmark(
  lapply = lapply(x, f),
  map = map(x, f)
)
summary(mb, unit = "ns")$median / n
#> [1] 490.343 546.880

Solution 2:

Comparing purrr and lapply boils down to convenience and speed.


1. purrr::map is syntactically more convenient than lapply

extract second element of the list

map(list, 2)  

which as @F. Privé pointed out, is the same as:

map(list, function(x) x[[2]])

with lapply

lapply(list, 2) # doesn't work

we need to pass an anonymous function...

lapply(list, function(x) x[[2]])  # now it works

...or as @RichScriven pointed out, we pass [[ as an argument into lapply

lapply(list, `[[`, 2)  # a bit more simple syntantically

So if find yourself applying functions to many lists using lapply, and tire of either defining a custom function or writing an anonymous function, convenience is one reason to favor purrr.

2. Type-specific map functions simply many lines of code

  • map_chr()
  • map_lgl()
  • map_int()
  • map_dbl()
  • map_df()

Each of these type-specific map functions returns a vector, rather than the lists returned by map() and lapply(). If you're dealing with nested lists of vectors, you can use these type-specific map functions to pull out the vectors directly, and coerce vectors directly into int, dbl, chr vectors. The base R version would look something like as.numeric(sapply(...)), as.character(sapply(...)), etc.

The map_<type> functions also have the useful quality that if they cannot return an atomic vector of the indicated type, they fail. This is useful when defining strict control flow, where you want a function to fail if it [somehow] generates the wrong object type.

3. Convenience aside, lapply is [slightly] faster than map

Using purrr's convenience functions, as @F. Privé pointed out slows down processing a bit. Let's race each of the 4 cases I presented above.

# devtools::install_github("jennybc/repurrrsive")
library(repurrrsive)
library(purrr)
library(microbenchmark)
library(ggplot2)

mbm <- microbenchmark(
lapply       = lapply(got_chars[1:4], function(x) x[[2]]),
lapply_2     = lapply(got_chars[1:4], `[[`, 2),
map_shortcut = map(got_chars[1:4], 2),
map          = map(got_chars[1:4], function(x) x[[2]]),
times        = 100
)
autoplot(mbm)

enter image description here

And the winner is....

lapply(list, `[[`, 2)

In sum, if raw speed is what you're after: base::lapply (although it's not that much faster)

For simple syntax and expressibility: purrr::map


This excellent purrr tutorial highlights the convenience of not having to explicitly write out anonymous functions when using purrr, and the benefits of type-specific map functions.