How to find the count of a word in a string?

If you want to find the count of an individual word, just use count:

input_string.count("Hello")

Use collections.Counter and split() to tally up all the words:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

Counter from collections is your friend:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

Using re.findall is more versatile than split, because otherwise you cannot take into account contractions such as "don't" and "I'll", etc.

Demo (using your example):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.


The vector of occurrence counts of words is called bag-of-words.

Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

Output:

2 am
1 going
2 hello
1 to
1 with

Part of the code was taken from this Kaggle tutorial on bag-of-words.

FYI: How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?