why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?
I wonder why two data frames a
and b
have different outcomes when a non-existent rowname is retrieved. For example,
a <- as.data.frame(matrix(1:3, ncol = 1, nrow = 3, dimnames = list(c("A1", "A10", "B"), "V1")))
a
V1
A1 1
A10 2
B 3
b <- as.data.frame(matrix(4:5, ncol = 1, nrow = 2, dimnames = list(c("A10", "B"), "V1")))
b
V1
A10 4
B 5
Let's try to get "A10", "A1", "A" from data frame a
:
> a["A10", 1]
[1] 2
> a["A1", 1]
[1] 1 # expected
> a["A", 1]
[1] NA # expected
> a["B", 1]
[1] 3 # expected
> a["C", 1]
[1] NA # expected
Let's do the same for data frame b
:
> b["A10", 1]
[1] 4
> b["A1", 1]
[1] 4 # unexpected, should be NA
> b["A", 1]
[1] 4 # unexpected, should be NA
> b["B", 1]
[1] 5 # expected
> b["C", 1]
[1] NA # expected
Now that a["A", 1]
returns NA
, why does b["A", 1]
or b["A1", 1]
not?
PS. R version 3.5.2
Solution 1:
Synthesizing some of the comments here...
?`[`
says:
Unlike S (Becker et al p. 358), R never uses partial matching when extracting by
[
, and partial matching is not by default used by[[
(see argumentexact
).
But ?`[.data.frame`
says:
Both
[
and[[
extraction methods partially match row names. By default neither partially match column names, but[[
will ifexact = FALSE
(and with a warning ifexact = NA
). If you want to exact matching on row names usematch
, as in the examples.
The example given there is:
sw <- swiss[1:5, 1:4]
sw["C", ]
## Fertility Agriculture Examination Education
## Courtelary 80.2 17 15 12
sw[match("C", row.names(sw)), ]
## Fertility Agriculture Examination Education
## NA NA NA NA NA
Meanwhile:
as.matrix(sw)["C", ]
## Error in as.matrix(sw)["C", ] : subscript out of bounds
So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.
[.data.frame
is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)
It is definitely surprising that [.data.frame
doesn't match the behaviour of [
with respect to character indices.