Why does sklearn tf-idf vectorizer give the highest scores to stopwords?

Solution 1:

Stopwords are assigned a large value as there is a problem with your corpus and tfidf calculation.

The shape of the matrix X is (15, 42396) meaning that you have only 15 documents and these documents contains 42396 different words.

The mistake is that you are concatenating all words of a given category into one document instead of using all defined document in this snippet:

for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

You can modify your code to:

for c in brown.categories():
    doc = [" ".join(x) for x in brown.sents(categories=c)]
    corpus.extend(doc)

which will create one entry per document. Therefore your X matrix will then have a shape of (57340, 42396).

This is really important as the stopwords will appear in most documents which will assign them a really low TFIDF value.

You can have a look at the 25 most important words with the following snippet:

import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]

Output:

 array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
        'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
        'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
        'currency', 'example', 'movies'], dtype=object)