why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?

I wonder why two data frames a and b have different outcomes when a non-existent rowname is retrieved. For example,

a <- as.data.frame(matrix(1:3, ncol = 1, nrow = 3, dimnames = list(c("A1", "A10", "B"), "V1")))
a
    V1
A1   1
A10  2
B    3

b <- as.data.frame(matrix(4:5, ncol = 1, nrow = 2, dimnames = list(c("A10", "B"), "V1")))
b
    V1
A10  4
B    5

Let's try to get "A10", "A1", "A" from data frame a:

> a["A10", 1]
[1] 2
> a["A1", 1]
[1] 1                    # expected
> a["A", 1]
[1] NA                   # expected
> a["B", 1]
[1] 3                    # expected
> a["C", 1]
[1] NA                   # expected

Let's do the same for data frame b:

> b["A10", 1]
[1] 4
> b["A1", 1]
[1] 4                    # unexpected, should be NA
> b["A", 1]              
[1] 4                    # unexpected, should be NA
> b["B", 1]
[1] 5                    # expected
> b["C", 1]
[1] NA                   # expected

Now that a["A", 1] returns NA, why does b["A", 1] or b["A1", 1] not?

PS. R version 3.5.2

Solution 1:

Synthesizing some of the comments here...

?`[` says:

Unlike S (Becker et al p. 358), R never uses partial matching when extracting by [, and partial matching is not by default used by [[ (see argument exact).

But ?`[.data.frame` says:

Both [ and [[ extraction methods partially match row names. By default neither partially match column names, but [[ will if exact = FALSE (and with a warning if exact = NA). If you want to exact matching on row names use match, as in the examples.

The example given there is:

sw <- swiss[1:5, 1:4]
sw["C", ]
##            Fertility Agriculture Examination Education
## Courtelary      80.2          17          15        12

sw[match("C", row.names(sw)), ]
##    Fertility Agriculture Examination Education
## NA        NA          NA          NA        NA

Meanwhile:

as.matrix(sw)["C", ]
## Error in as.matrix(sw)["C", ] : subscript out of bounds

So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.

[.data.frame is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:

    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        i <- pmatch(i, rows, duplicates.ok = TRUE)
    }

There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)

It is definitely surprising that [.data.frame doesn't match the behaviour of [ with respect to character indices.

why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?

Solution 1:

Related

Recent Posts