Difference between Python's collections.Counter and nltk.probability.FreqDist
Solution 1:
nltk.probability.FreqDist
is a subclass of collections.Counter
.
From the docs:
A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
The inheritance is explicitly shown from the code and essentially, there's no difference in terms of how a Counter
and FreqDist
is initialized, see https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L106
So speed-wise, creating a Counter
and FreqDist
should be the same. The difference in speed should be insignificant but it's good to note that the overheads could be:
- the compilation of the class in when defining it in an interpreter
- the cost of duck-typing
.__init__()
The major difference is the various functions that FreqDist
provides for statistical / probabilistic Natural Language Processing (NLP), e.g. finding hapaxes. The full list of functions that FreqDist
extends Counter
are as followed:
>>> from collections import Counter
>>> from nltk import FreqDist
>>> x = FreqDist()
>>> y = Counter()
>>> set(dir(x)).difference(set(dir(y)))
set(['plot', 'hapaxes', '_cumulative_frequencies', 'r_Nr', 'pprint', 'N', 'unicode_repr', 'B', 'tabulate', 'pformat', 'max', 'Nr', 'freq', '__unicode__'])
When it comes to using FreqDist.most_common()
, it's actually using the parent function from Counter
so the speed of retrieving the sorted most_common
list is the same for both types.
Personally, when I just want to retrieve counts, I use collections.Counter
. But when I need to do some statistical manipulation, I either use nltk.FreqDist
or I would dump the Counter
into a pandas.DataFrame
(see Transform a Counter object into a Pandas DataFrame).