What is the practical difference between data.frame and data.table in R [duplicate]
While this is a broad question, if someone is new to R
this can be confusing and the distinction can get lost.
All data.table
s are also data.frame
s. Loosely speaking, you can think of data.tables as data.frames with extra features.
data.frame
is part of base R
.
data.table
is a package that extends data.frames
. Two of its most notable features are speed and cleaner syntax.
However, that syntax sugar is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results. (a clear giveaway that you are working with d.t's, besides the library
/require
call is the presence of the assignment operator :=
which is unique to d.t)
With all that being said, I think it is hard to actually appreciate the beauty of data.table
without experiencing the shortcomings of data.frame
. (for example, see the first 3 bullet points of @eddi's answer). In other words, I would very much suggest learning how to work with and manipulate data.frames
first then move on to data.table
s.
A few differences in my day to day life that come to mind (in no particular order):
- not having to specify the
data.table
name over and over (leading to clumsy syntax and silly mistakes) in expressions (on the flip side I sometimes miss the TAB-completion of names) - much faster and very intuitive
by
operations - no more frantically hitting Ctrl-C after typing
df
, forgetting how largedf
was (also leading to almost never usinghead
) - faster and better file reading with
fread
- the package also provides a number of other utility functions, like
%between%
orrbindlist
that make life better - faster everything else, since a lot of
data.frame
operations copy the entire thing needlessly