In R, what exactly is the problem with having variables with the same name as base R functions?

It seems to be generally considered poor programming practise to use variable names that have functions in base R with the same name.

For example, it is tempting to write:

data <- data.frame(...)
df   <- data.frame(...)

Now, the function data loads data sets while the function df computes the f density function.

Similarly, it is tempting to write:

a <- 1
b <- 2
c <- 3

This is considered bad form because the function c will combine its arguments.

But: In that workhorse of R functions, lm, to compute linear models, data is used as an argument. In other words, data becomes an explicit variable inside the lm function.

So: If the R core team can use identical names for variables and functions, what stops us mere mortals?

The answer is not that R will get confused. Try the following example, where I explicitly assign a variable with the name c. R doesn't get confused at all with the difference between variable and function:

c("A", "B")
[1] "A" "B"

c <- c("Some text", "Second", "Third")
c(1, 3, 5)
[1] 1 3 5

c[3]
[1] "Third"

The question: What exactly is the problem with having variable with the same name as base R function?

Solution 1:

There isn't really one. R will not normally search objects (non function objects) when looking for a function:

> mean(1:10)
[1] 5.5
> mean <- 1
> mean(1:10)
[1] 5.5
> rm(mean)
> mean(1:10)
[1] 5.5

The examples shown by @Joris and @Sacha are where poor coding catches you out. One better way to write foo is:

foo <- function(x, fun) {
    fun <- match.fun(fun)
    fun(x)
}

Which when used gives:

> foo(1:10, mean)
[1] 5.5
> mean <- 1
> foo(1:10, mean)
[1] 5.5

There are situations where this will catch you out, and @Joris's example with na.omit is one, which IIRC, is happening because of the standard, non-standard evaluation used in lm().

Several Answers have also conflated the T vs TRUE issue with the masking of functions issue. As T and TRUE are not functions that is a little outside the scope of @Andrie's Question.

Solution 2:

The problem is not so much the computer, but the user. In general, code can become a lot harder to debug. Typos are made very easily, so if you do :

c <- c("Some text", "Second", "Third")
c[3]
c(3)

You get the correct results. But if you miss somewhere in a code and type c(3) instead of c[3], finding the error will not be that easy.

The scoping can also lead to very confusing error reports. Take following flawed function :

my.foo <- function(x){
    if(x) c <- 1
    c + 1
}

> my.foo(TRUE)
[1] 2
> my.foo(FALSE)
Error in c + 1 : non-numeric argument to binary operator

With more complex functions, this can lead you on a debugging trail leading nowhere. If you replace c with x in the above function, the error will read "object 'x' not found". That will lead a lot faster to your coding error.

Next to that, it can lead to rather confusing code. Code like c(c+c(a,b,c)) asks more from the brain than c(d+c(a,b,d)). Again, this is a trivial example, but it can make a difference.

And obviously, you can get errors too. When you expect a function, you won't get it, which can give rise to another set of annoying bugs :

my.foo <- function(x,fun) fun(x)
my.foo(1,sum)
[1] 1
my.foo(1,c)
Error in my.foo(1, c) : could not find function "fun"

A more realistic (and real-life) example of how this can cause trouble :

x <- c(1:10,NA)
y <- c(NA,1:10)
lm(x~y,na.action=na.omit)
# ... correct output ...
na.omit <- TRUE
lm(x~y,na.action=na.omit)
Error in model.frame.default(formula = x ~ y, na.action = na.omit, 
drop.unused.levels = TRUE) : attempt to apply non-function

Try figuring out what's wrong here if na.omit <- TRUE occurs 50 lines up in your code...

Answer edited after comment of @Andrie to include the example of confusing error reports

Solution 3:

R is very robust to this, but you can think of ways to break it. For example, consider this funcion:

foo <- function(x,fun) fun(x)

Which simply applies fun to x. Not the prettiest way to do this but you might encounter this from someones script or so. This works for mean():

> foo(1:10,mean)
[1] 5.5

But if I assign a new value to mean it breaks:

mean <- 1
foo(1:10,mean)

Error in foo(1:10, mean) : could not find function "fun"

This will happen very rarely, but it might happen. It is also very confusing for people if the same thing means two things:

mean(mean)

Since it is trivial to use any other name you want, why not use a different name then base R functions? Also, for some R variables this becomes even more important. Think of reassigning the '+' function! Another good example is reassignment of T and F which can break so much scripts.

How to define a C++ preprocessor macro through the command line with CMake?

sass-rails helpers "image-url", "asset-url" are not working in rails 3.2.1

What is the difference between call hierarchy and find references eclipse?

How are Java threads heavy compared to Scala / Akka actors?

Can I use autolayout to provide different constraints for landscape and portrait orientations?

Creating a postgresql DB using psycopg2

Should I commit .gitignore file? [duplicate]

Cannot convert lambda expression to type 'object' because it is not a delegate type

Use sudo without password INSIDE a script

Is it a good idea to make Ansible and Rundeck work together, or using either one is enough?

Why do I have to always specify the range in STL's algorithm functions explicitly, even if I want to work on the whole container?

Logical OR for expected results in Jest