SparkR vs sparklyr [closed]

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best


The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.


Being a wrapper, there are some limitations to sparklyr. For example, using copy_to() to create a Spark dataframe does not preserve columns formatted as dates. With SparkR, as.Dataframe() preserves dates.


I can give you the highlights for sparklyr:

  • Supports dplyr, Spark ML and H2O.
  • Distributed on CRAN.
  • Easy to install.
  • Extensible.

In the current 0.4 version, it does not support arbitrary parallel code execution yet. However, extensions can be easily written in Scala to overcome this limitation, see sparkhello.


For the overview and indepth details, you may refer to the documentation. Quoting from the documentation, "the sparklyr package provides a complete dplyr backend". This reflects that sparklyr is NOT a replacement to the original apache spark but an extension to it.

Continuing further, talking about its installation (I'm a Windows user) on a standalone computer you would either need to download and install the new RStudio Preview version or else execute the following series of commands in the RStudio shell,

> devtools::install_github("rstudio/sparklyr")

install readr and digest packages if you do not have them installed.

install.packages("readr")
install.packages("digest")
library(sparklyr)
spark_install(version = "1.6.2")`

Once the packages are installed and you try to connect Connecting to local instance of spark using the command;

sc <- spark_connect(master = "local")

You may see an error such as

Created default hadoop bin directory under: C:\spark-1.6.2\tmp\hadoop Error:

To run Spark on Windows you need a copy of Hadoop winutils.exe:

  1. Download Hadoop winutils.exe from
  2. Copy winutils.exe to C:\spark-1.6.2\tmp\hadoop\bin

Alternatively, if you are using RStudio you can install the RStudio Preview Release which includes an embedded copy of Hadoop winutils.exe.

The error resolution is given to you. Head over to the github account, download the winutils.exe file and save it to the location, C:\spark-1.6.2\tmp\hadoop\bin and try creating the spark context again. Last year I published a comprehensive post on my blog detailing the installation and working with sparkR on windows environment.

Having said that, I would recommend not to go through this painful path of installing a local instance of spark on the usual RStudio, rather try the RStudio Preview version. It will greatly save you the hassle of creating the sparkcontext. Continuing further, here is a detailed post on how sparklyr can be used R-bloggers.

I hope this helps.

Cheers.