DocumentTermMatrix error on Corpus argument

Solution 1:

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

Solution 2:

I have found a way to solve this problem in an article about TM.

An example in which the error follows below:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)

Warning messages:

In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers

This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.

The code below changes the class (see second last line) and avoids the error:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class 
matrix_term <- DocumentTermMatrix(corpus)

Solution 3:

Change this:

corpus_clean <- tm_map(news_corpus, tolower)

For this:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))