Problem with creating dictionary with gensim for LDA

I have a problem running gensim to create a Dictionary and the Doc Term Matrix.

When I run:

from gensim import corpora, models
import gensim
clean = ['door', 'cat', 'mom']
dictionary = corpora.Dictionary(clean)

I get:

doc2bow expects an array of unicode tokens on input, not a single string

In the real problem, Clean is still a list-type variable. It's all the words in a large corpus after applying a tokenizer, tagger, removing punctuation, etc.

Why am I getting this error?

Each item in the corpus should be a sequence of unicode tokens (words), not a string.

If you want the strings 'door', 'cat', & 'mom' to be the words in the dictionary, you could do:

from gensim import corpora
corpus = [
    ['door', 'cat', 'mom'],
]
dictionary = corpora.Dictionary(corpus)

Problem with creating dictionary with gensim for LDA

Related

Recent Posts