Retrieving verb stems from a list of verbs

I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a “verb” is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {X, Xes, Xed, Xing} where X is the verb. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, however I'm new to regex and I am totally lost.


Solution 1:

There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are stemmers, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:

In [3]: import nltk

In [4]: verbs = ["want", "wants", "wanting", "wanted"]

In [5]: for verb in verbs:
   ...:     print nltk.stem.porter.PorterStemmer().stem_word(verb)
   ...:     
want
want
want
want

You could use this in conjunction with a defaultdict to do something like this (note: in Python 2.7+, a Counter would be equally useful/better):

In [2]: from collections import defaultdict

In [3]: from nltk.stem.porter import PorterStemmer

In [4]: verbs = ["want", "wants", "wanting", "wanted", "running", "runs", "run"]

In [5]: freq = defaultdict(int)

In [6]: for verb in verbs:
   ...:     freq[PorterStemmer().stem_word(verb)] += 1
   ...:     

In [7]: freq
Out[7]: defaultdict(<type 'int'>, {'run': 3, 'want': 4})

One thing to note: the stemmers aren't perfect - for instance, adding ran to the above yields this as the result:

defaultdict(<type 'int'>, {'ran': 1, 'run': 3, 'want': 4})

However hopefully it will get you close to what you want.

Solution 2:

To get the base word purely by pattern matching, you could use this code:

import re

for word in verblist:
    mtch = re.match(r"([a-zA-Z]*)((ed)|(ing)|(s))", word)
    if mtch:
        base = mtch.group(1)
    else:
        base = word
    #process the base word here

Keep in mind, this wouldn't handle irregular verbs well, and it relies on your list containing only verbs. Now, to actually keep track of counts, a dict would probably be best. A dict can be created before the loop with counts = {}. Then, to increment for each word, you can do the following at the end of each iteration:

    if base in counts:
        counts[base] += 1
    else:
        counts[base] = 1

RocketDonkey beat me to an answer while I was typing, and his answer looks like it'll work better, but I'm posting anyway since this doesn't require extra libraries to be installed, if that's worth anything to you.