Why does sklearn tf-idf vectorizer give the highest scores to stopwords?
Solution 1:
Stopwords are assigned a large value as there is a problem with your corpus and tfidf calculation.
The shape of the matrix X
is (15, 42396)
meaning that you have only 15 documents and these documents contains 42396 different words.
The mistake is that you are concatenating all words of a given category into one document instead of using all defined document in this snippet:
for c in brown.categories():
doc = ' '.join(brown.words(categories=c))
corpus.append(doc)
You can modify your code to:
for c in brown.categories():
doc = [" ".join(x) for x in brown.sents(categories=c)]
corpus.extend(doc)
which will create one entry per document. Therefore your X
matrix will then have a shape of (57340, 42396)
.
This is really important as the stopwords will appear in most documents which will assign them a really low TFIDF value.
You can have a look at the 25 most important words with the following snippet:
import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]
Output:
array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
'currency', 'example', 'movies'], dtype=object)