Problem with creating dictionary with gensim for LDA
I have a problem running gensim to create a Dictionary and the Doc Term Matrix.
When I run:
from gensim import corpora, models
import gensim
clean = ['door', 'cat', 'mom']
dictionary = corpora.Dictionary(clean)
I get:
doc2bow expects an array of unicode tokens on input, not a single string
In the real problem, Clean is still a list-type variable. It's all the words in a large corpus after applying a tokenizer, tagger, removing punctuation, etc.
Why am I getting this error?
Each item in the corpus should be a sequence of unicode tokens (words), not a string.
If you want the strings 'door'
, 'cat'
, & 'mom'
to be the words in the dictionary, you could do:
from gensim import corpora
corpus = [
['door', 'cat', 'mom'],
]
dictionary = corpora.Dictionary(corpus)