In R, what exactly is the problem with having variables with the same name as base R functions?
It seems to be generally considered poor programming practise to use variable names that have functions in base R with the same name.
For example, it is tempting to write:
data <- data.frame(...)
df <- data.frame(...)
Now, the function data
loads data sets while the function df
computes the f density function.
Similarly, it is tempting to write:
a <- 1
b <- 2
c <- 3
This is considered bad form because the function c
will combine its arguments.
But: In that workhorse of R functions, lm
, to compute linear models, data
is used as an argument. In other words, data
becomes an explicit variable inside the lm
function.
So: If the R core team can use identical names for variables and functions, what stops us mere mortals?
The answer is not that R will get confused. Try the following example, where I explicitly assign a variable with the name c
. R doesn't get confused at all with the difference between variable and function:
c("A", "B")
[1] "A" "B"
c <- c("Some text", "Second", "Third")
c(1, 3, 5)
[1] 1 3 5
c[3]
[1] "Third"
The question: What exactly is the problem with having variable with the same name as base R function?
Solution 1:
There isn't really one. R will not normally search objects (non function objects) when looking for a function:
> mean(1:10)
[1] 5.5
> mean <- 1
> mean(1:10)
[1] 5.5
> rm(mean)
> mean(1:10)
[1] 5.5
The examples shown by @Joris and @Sacha are where poor coding catches you out. One better way to write foo
is:
foo <- function(x, fun) {
fun <- match.fun(fun)
fun(x)
}
Which when used gives:
> foo(1:10, mean)
[1] 5.5
> mean <- 1
> foo(1:10, mean)
[1] 5.5
There are situations where this will catch you out, and @Joris's example with na.omit
is one, which IIRC, is happening because of the standard, non-standard evaluation used in lm()
.
Several Answers have also conflated the T
vs TRUE
issue with the masking of functions issue. As T
and TRUE
are not functions that is a little outside the scope of @Andrie's Question.
Solution 2:
The problem is not so much the computer, but the user. In general, code can become a lot harder to debug. Typos are made very easily, so if you do :
c <- c("Some text", "Second", "Third")
c[3]
c(3)
You get the correct results. But if you miss somewhere in a code and type c(3)
instead of c[3]
, finding the error will not be that easy.
The scoping can also lead to very confusing error reports. Take following flawed function :
my.foo <- function(x){
if(x) c <- 1
c + 1
}
> my.foo(TRUE)
[1] 2
> my.foo(FALSE)
Error in c + 1 : non-numeric argument to binary operator
With more complex functions, this can lead you on a debugging trail leading nowhere. If you replace c
with x
in the above function, the error will read "object 'x' not found
". That will lead a lot faster to your coding error.
Next to that, it can lead to rather confusing code. Code like c(c+c(a,b,c))
asks more from the brain than c(d+c(a,b,d))
. Again, this is a trivial example, but it can make a difference.
And obviously, you can get errors too. When you expect a function, you won't get it, which can give rise to another set of annoying bugs :
my.foo <- function(x,fun) fun(x)
my.foo(1,sum)
[1] 1
my.foo(1,c)
Error in my.foo(1, c) : could not find function "fun"
A more realistic (and real-life) example of how this can cause trouble :
x <- c(1:10,NA)
y <- c(NA,1:10)
lm(x~y,na.action=na.omit)
# ... correct output ...
na.omit <- TRUE
lm(x~y,na.action=na.omit)
Error in model.frame.default(formula = x ~ y, na.action = na.omit,
drop.unused.levels = TRUE) : attempt to apply non-function
Try figuring out what's wrong here if na.omit <- TRUE
occurs 50 lines up in your code...
Answer edited after comment of @Andrie to include the example of confusing error reports
Solution 3:
R is very robust to this, but you can think of ways to break it. For example, consider this funcion:
foo <- function(x,fun) fun(x)
Which simply applies fun
to x
. Not the prettiest way to do this but you might encounter this from someones script or so. This works for mean()
:
> foo(1:10,mean)
[1] 5.5
But if I assign a new value to mean it breaks:
mean <- 1
foo(1:10,mean)
Error in foo(1:10, mean) : could not find function "fun"
This will happen very rarely, but it might happen. It is also very confusing for people if the same thing means two things:
mean(mean)
Since it is trivial to use any other name you want, why not use a different name then base R functions? Also, for some R variables this becomes even more important. Think of reassigning the '+'
function! Another good example is reassignment of T
and F
which can break so much scripts.