Retrieving verb stems from a list of verbs

I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a “verb” is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {X, Xes, Xed, Xing} where X is the verb. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, however I'm new to regex and I am totally lost.

Solution 1:

There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are stemmers, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:

In [3]: import nltk

In [4]: verbs = ["want", "wants", "wanting", "wanted"]

In [5]: for verb in verbs:
   ...:     print nltk.stem.porter.PorterStemmer().stem_word(verb)
   ...:     
want
want
want
want

You could use this in conjunction with a defaultdict to do something like this (note: in Python 2.7+, a Counter would be equally useful/better):

In [2]: from collections import defaultdict

In [3]: from nltk.stem.porter import PorterStemmer

In [4]: verbs = ["want", "wants", "wanting", "wanted", "running", "runs", "run"]

In [5]: freq = defaultdict(int)

In [6]: for verb in verbs:
   ...:     freq[PorterStemmer().stem_word(verb)] += 1
   ...:     

In [7]: freq
Out[7]: defaultdict(<type 'int'>, {'run': 3, 'want': 4})

One thing to note: the stemmers aren't perfect - for instance, adding ran to the above yields this as the result:

defaultdict(<type 'int'>, {'ran': 1, 'run': 3, 'want': 4})

However hopefully it will get you close to what you want.

Solution 2:

To get the base word purely by pattern matching, you could use this code:

import re

for word in verblist:
    mtch = re.match(r"([a-zA-Z]*)((ed)|(ing)|(s))", word)
    if mtch:
        base = mtch.group(1)
    else:
        base = word
    #process the base word here

Keep in mind, this wouldn't handle irregular verbs well, and it relies on your list containing only verbs. Now, to actually keep track of counts, a dict would probably be best. A dict can be created before the loop with counts = {}. Then, to increment for each word, you can do the following at the end of each iteration:

    if base in counts:
        counts[base] += 1
    else:
        counts[base] = 1

RocketDonkey beat me to an answer while I was typing, and his answer looks like it'll work better, but I'm posting anyway since this doesn't require extra libraries to be installed, if that's worth anything to you.

Retrieving verb stems from a list of verbs

Solution 1:

Solution 2:

Related

Recent Posts