How do Rpy2, pyrserve and PypeR compare?
I would like to access R from within a Python program. I am aware of Rpy2, pyrserve and PypeR.
What are the advantages or disadvantages of these three options?
I know one of the 3 better than the others, but in the order given in the question:
rpy2:
- C-level interface between Python and R (R running as an embedded process)
- R objects exposed to Python without the need to copy the data over
- Conversely, Python's numpy arrays can be exposed to R without making a copy
- Low-level interface (close to the R C-API) and high-level interface (for convenience)
- In-place modification for vectors and arrays possible
- R callback functions can be implemented in Python
- Possible to have anonymous R objects with a Python label
- Python pickling possible
- Full customization of R's behavior with its console (so possible to implement a full R GUI)
- MSWindows with limited support
pyrserve:
- native Python code (will/should/may work with CPython, Jython, IronPython)
- use R's Rserve
- advantages and inconveniences linked to remote computation and to RServe
pyper:
- native Python code (will/should/may work with CPython, Jython, IronPython)
- use of pipes to have Python communicate with R (with the advantages and inconveniences linked to it)
edit: Windows support for rpy2
From the paper in the Journal of Statistical Software on PypeR:
RPy presents a simple and efficient way of accessing R from Python. It is robust and very convenient for frequent interaction operations between Python and R. This package allows Python programs to pass Python objects of basic data types to R functions and return the results in Python objects. Such features make it an attractive solution for the cases in which Python and R interact frequently. However, there are still limitations of this package as listed below.
Performance:
RPy may not behave very well for large-size data sets or for computation-intensive duties. A lot of time and memory are inevitably consumed in producing the Python copy of the R data because in every round of a conversation RPy converts the returned value of an R expression into a Python object of basic types or NumPy array. RPy2, a recently developed branch of RPy, uses Python objects to refer to R objects instead of copying them back into Python objects. This strategy avoids frequent data conversions and improves speed. However, memory consumption remains a problem. [...] When we were implementing WebArray (Xia et al. 2005), an online platform for microarray data analysis, a job consumed roughly one quarter more computational time if running R through RPy instead of through R's command-line user interface. Therefore, we decided to run R in Python through pipes in subsequent developments, e.g., WebArrayDB (Xia et al. 2009), which retained the same performance as achieved when running R independently. We do not know the exact reason for such a difference in performance, but we noticed that RPy directly uses the shared library of R to run R scripts. In contrast, running R through pipes means running the R interpreter directly.
Memory:
R has been denounced for its uneconomical use of memory. The memory used by large- size R objects is rarely released after these objects are deleted. Sometimes the only way to release memory from R is to quit R. RPy module wraps R in a Python object. However, the R library will stay in memory even if the Python object is deleted. In other words, memory used by R cannot be released until the host Python script is terminated.
Portability:
As a module with extensions written in C, the RPy source package has to be compiled with a specific R version on POSIX (Portable Operating System Interface for Unix) systems, and the R must be compiled with the shared library enabled. Also, the binary distributions for Windows are bound to specic combinations of different versions of Python/R, so it is quite frequent that a user has difficulty in finding a distribution that ts the user's software environment.