What is the difference between `%in%` and `==`?
df <- structure(list(x = 1:10, time = c(0.5, 0.5, 1, 2, 3, 0.5, 0.5,
1, 2, 3)), .Names = c("x", "time"), row.names = c(NA, -10L), class = "data.frame")
df[df$time %in% c(0.5, 3), ]
## x time
## 1 1 0.5
## 2 2 0.5
## 5 5 3.0
## 6 6 0.5
## 7 7 0.5
## 10 10 3.0
df[df$time == c(0.5, 3), ]
## x time
## 1 1 0.5
## 7 7 0.5
## 10 10 3.0
What is the difference between %in%
and ==
here?
The problem is vector recycling.
Your first line does exactly what you'd expect. It checks what elements of df$time
are in c(0.5, 3)
and returns the values which are.
Your second line is trickier. It's actually equivalent to
df[df$time == rep(c(0.5,3), length.out=nrow(df)),]
To see this, let's see what happens if use a vector rep(0.5, 10)
:
rep(0.5, 10) == c(0.5, 3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
See how it returns every odd value. Essentially it's matching 0.5 to the vector c(0.5, 3, 0.5, 3, 0.5...)
You can manipulate a vector to produce no matches this way. Take the vector: rep(c(3, 0.5), 5)
:
rep(c(3, 0.5), 5) == c(0.5, 3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
They're all FALSE
; you are matching every 0.5 with 3 and vice versa.
In
df$time == c(0.5,3)
the c(0.5,3)
first gets broadcast to the shape of df$time
, i.e. c(0.5,3,0.5,3,0.5,3,0.5,3,0.5,3)
. Then the two vectors are compared element-by-element.
On the other hand,
df$time %in% c(0.5,3)
checks whether each element of df$time
belongs to the set {0.5, 3}
.
This is an old thread, but I haven't seen this answer anywhere and it might be relevant for some people.
Another difference between the two is handling of NAs (missing values).
NA == NA
[1] NA
NA %in% c(NA)
[1] TRUE