What exactly is copy-on-modify semantics in R, and where is the canonical source?
Every once in a while I come across the notion that R has copy-on-modify semantics, for example in Hadley's devtools wiki.
Most R objects have copy-on-modify semantics, so modifying a function argument does not change the original value
I can trace this term back to the R-Help mailing list. For example, Peter Dalgaard wrote in July 2003:
R is a functional language, with lazy evaluation and weak dynamic typing (a variable can change type at will: a <- 1 ; a <- "a" is allowed). Semantically, everything is copy-on-modify although some optimization tricks are used in the implementation to avoid the worst inefficiencies.
Similarly, Peter Dalgaard wrote in Jan 2004:
R has copy-on-modify semantics (in principle and sometimes in practice) so once part of an object changes, you may have to look in new places for anything that contained it, including possibly the object itself.
Even further back, in Feb 2000 Ross Ihaka said:
We put quite a bit of work into making this happen. I would describe the semantics as "copy on modify (if necessary)". Copying is done only when objects are modified. The (if necessary) part means that if we can prove that the modification cannot change any non-local variables then we just go ahead and modify without copying.
It's not in the manual
No matter how hard I've searched, I can't find a reference to "copy-on-modify" in the R manuals, neither in R Language Definition nor in R Internals
Question
My question has two parts:
- Where is this formally documented?
- How does copy-on-modify work?
For example, is it proper to talk about "pass-by-reference", since a promise gets passed to the function?
Call-by-value
The R Language Definition says this (in section 4.3.3 Argument Evaluation)
The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. [Emphasis added]
Whilst this does not describe the mechanism by which copy-on-modify works, it does mention that changing an object passed to a function doesn't affect the original in the calling frame.
Additional information, particularly on the copy-on-modify aspect are given in the description of SEXP
s in the R Internals manual, section 1.1.2 Rest of Header. Specifically it states [Emphasis added]
The
named
field is set and accessed by theSET_NAMED
andNAMED
macros, and take values0
,1
and2
. R has a 'call by value' illusion, so an assignment likeb <- a
appears to make a copy of
a
and refer to it asb
. However, if neithera
norb
are subsequently altered there is no need to copy. What really happens is that a new symbolb
is bound to the same value asa
and thenamed
field on the value object is set (in this case to2
). When an object is about to be altered, thenamed
field is consulted. A value of2
means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of0
means that it is known that no otherSEXP
shares data with this object, and so it may safely be altered. A value of1
is used for situations likedim(a) <- c(7, 2)
where in principle two copies of a exist for the duration of the computation as (in principle)
a <- `dim<-`(a, c(7, 2))
but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.
Whilst this doesn't describe the situation whereby objects are passed to functions as arguments, we might deduce that the same process operates, especially given the information from the R Language definition quoted earlier.
Promises in function evaluation
I don't think it is quite correct to say that a promise is passed to the function. The arguments are passed to the function and the actual expressions used are stored as promises (plus a pointer to the calling environment). Only when an argument gets evaluated is the expression stored in the promise retrieved and evaluated within the environment indicated by the pointer, a process known as forcing.
As such, I don't believe it is correct to talk about pass-by-reference in this regard. R has call-by-value semantics but tries to avoid copying unless a value passed to an argument is evaluated and modified.
The NAMED mechanism is an optimisation (as noted by @hadley in the comments) which allows R to track whether a copy needs to be made upon modification. There are some subtleties involved with exactly how the NAMED mechanism operates, as discussed by Peter Dalgaard (in the R Devel thread @mnel cites in their comment to the question)
I did some experiments on it, and discovered that R always copies the object under the first modification.
You can see the result on my machine in http://rpubs.com/wush978/5916
Please let me know if I made any mistake, thanks.
To test if an object is copied or not
I dump the memory address with the following C code:
#define USE_RINTERNALS
#include <R.h>
#include <Rdefines.h>
SEXP dump_address(SEXP src) {
Rprintf("%16p %16p %d\n", &(src->u), INTEGER(src), INTEGER(src) - (int*)&(src->u));
return R_NilValue;
}
It will print 2 address:
- The address of data block of
SEXP
- The address of continuous block of
integer
Let's compile and load this C function.
Rcpp:::SHLIB("dump_address.c")
dyn.load("dump_address.so")
Session Info
Here is the sessionInfo
of the testing environment.
sessionInfo()
Copy on Write
First I test the property of copy on write, which means that R only copy the object only when it is modified.
a <- 1L
b <- a
invisible(.Call("dump_address", a))
invisible(.Call("dump_address", b))
b <- b + 1
invisible(.Call("dump_address", b))
The object b
copies from a
at the modification. R does implement the copy on write
property.
Modify vector/matrix in place
Then I test if R will copy the object when we modify an element of a vector/matrix.
Vector with length 1
a <- 1L
invisible(.Call("dump_address", a))
a <- 1L
invisible(.Call("dump_address", a))
a[1] <- 1L
invisible(.Call("dump_address", a))
a <- 2L
invisible(.Call("dump_address", a))
The address changes every time which means that R does not reuse the memory.
Long vector
system.time(a <- rep(1L, 10^7))
invisible(.Call("dump_address", a))
system.time(a[1] <- 1L)
invisible(.Call("dump_address", a))
system.time(a[1] <- 1L)
invisible(.Call("dump_address", a))
system.time(a[1] <- 2L)
invisible(.Call("dump_address", a))
For long vector, R reuse the memory after the first modification.
Moreover, the above example also shows that "modify in place" does affect the performance when the object is huge.
Matrix
system.time(a <- matrix(0L, 3162, 3162))
invisible(.Call("dump_address", a))
system.time(a[1,1] <- 0L)
invisible(.Call("dump_address", a))
system.time(a[1,1] <- 1L)
invisible(.Call("dump_address", a))
system.time(a[1] <- 2L)
invisible(.Call("dump_address", a))
system.time(a[1] <- 2L)
invisible(.Call("dump_address", a))
It seems that R copies the object at the first modifications only.
I don't know why.
Changing attribute
system.time(a <- vector("integer", 10^2))
invisible(.Call("dump_address", a))
system.time(names(a) <- paste(1:(10^2)))
invisible(.Call("dump_address", a))
system.time(names(a) <- paste(1:(10^2)))
invisible(.Call("dump_address", a))
system.time(names(a) <- paste(1:(10^2) + 1))
invisible(.Call("dump_address", a))
The result is the same. R only copies the object at the first modification.