Retrieving verb stems from a list of verbs
I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a “verb” is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {X, Xes, Xed, Xing} where X is the verb. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, however I'm new to regex and I am totally lost.
Solution 1:
There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are stemmers
, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:
In [3]: import nltk
In [4]: verbs = ["want", "wants", "wanting", "wanted"]
In [5]: for verb in verbs:
...: print nltk.stem.porter.PorterStemmer().stem_word(verb)
...:
want
want
want
want
You could use this in conjunction with a defaultdict
to do something like this (note: in Python 2.7+, a Counter
would be equally useful/better):
In [2]: from collections import defaultdict
In [3]: from nltk.stem.porter import PorterStemmer
In [4]: verbs = ["want", "wants", "wanting", "wanted", "running", "runs", "run"]
In [5]: freq = defaultdict(int)
In [6]: for verb in verbs:
...: freq[PorterStemmer().stem_word(verb)] += 1
...:
In [7]: freq
Out[7]: defaultdict(<type 'int'>, {'run': 3, 'want': 4})
One thing to note: the stemmers aren't perfect - for instance, adding ran
to the above yields this as the result:
defaultdict(<type 'int'>, {'ran': 1, 'run': 3, 'want': 4})
However hopefully it will get you close to what you want.
Solution 2:
To get the base word purely by pattern matching, you could use this code:
import re
for word in verblist:
mtch = re.match(r"([a-zA-Z]*)((ed)|(ing)|(s))", word)
if mtch:
base = mtch.group(1)
else:
base = word
#process the base word here
Keep in mind, this wouldn't handle irregular verbs well, and it relies on your list containing only verbs. Now, to actually keep track of counts, a dict would probably be best. A dict can be created before the loop with counts = {}
. Then, to increment for each word, you can do the following at the end of each iteration:
if base in counts:
counts[base] += 1
else:
counts[base] = 1
RocketDonkey beat me to an answer while I was typing, and his answer looks like it'll work better, but I'm posting anyway since this doesn't require extra libraries to be installed, if that's worth anything to you.