Should I use a data.frame or a matrix?
Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.
Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.
The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.
Also:
Matrices are more memory efficient:
m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes
Matrices are a necessity if you plan to do any linear algebra-type of operations.
Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).
Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.
Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.
Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.
Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.
In converting a data frame to a matrix, note that there is a data.matrix()
function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing via as.matrix()
will result in a character matrix if any of the factor labels is non-numeric. Compare:
> head(as.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
a B
[1,] "a" "A"
[2,] "b" "B"
[3,] "c" "C"
[4,] "d" "D"
[5,] "e" "E"
[6,] "f" "F"
> head(data.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
a B
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.
@Michal: Matrices aren't really more memory efficient:
m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 1600776 bytes
... unless you have a large number of columns:
m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 22400568 bytes