Should I use a data.frame or a matrix?

Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.

Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.

The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.

Also:

Matrices are more memory efficient:

m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes

Matrices are a necessity if you plan to do any linear algebra-type of operations.

Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).

Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.


Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.

Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.

Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.

In converting a data frame to a matrix, note that there is a data.matrix() function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing via as.matrix() will result in a character matrix if any of the factor labels is non-numeric. Compare:

> head(as.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a   B  
[1,] "a" "A"
[2,] "b" "B"
[3,] "c" "C"
[4,] "d" "D"
[5,] "e" "E"
[6,] "f" "F"
> head(data.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a B
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6

I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.


@Michal: Matrices aren't really more memory efficient:

m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 1600776 bytes

... unless you have a large number of columns:

m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 22400568 bytes