stemCompletion is not working

Solution 1:

I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:

>  getTransformations
function () 
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", 
    "stripWhitespace")
<environment: namespace:tm>

Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.

Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:

> stemCompletion("compani",dictCorpus)
    compani 
"companies" 

So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.


For a full example, let's setup a dummy corpus using the crude data in the tm package:

> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today 
made light fall oil product price weak crude oil market compani spokeswoman said diamond 
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today 
made light fall oil product price weak crude oil market companies spokeswoman said diamond 
latest line us oil companies cut contract posted price last two day cited weak oil market reuter

Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...

To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).

Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.

Solution 2:

I had success with the following workflow:

  1. use content_transformer to apply an anonymous function on each document of the corpus,
  2. split the document to words by spaces,
  3. call stemCompletion on the words with the help of the dictionary,
  4. and concatenate the separate words into a document again with paste.

POC demo code:

tm_map(c, content_transformer(function(x, d)
  paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)

PS: using c as a variable name to store the corpus is not a good idea due to base::c

Solution 3:

Thanks, cdxsza. Your method worked for me.

A note to all who are going to use stemCompletion:

The function completes an empty string with a word in dictionary, which is unexpected. See an example below, where the first "monday" was produced for the blank at the beginning of the string.

stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))


[1]   "monday"  "monday" "tuesday" 

It can be easily fixed by removing empty string "" before stemCompletion as below.

stemCompletion2 <- function(x, dictionary) {

   x <- unlist(strsplit(as.character(x), " "))

   x <- x[x != ""]

   x <- stemCompletion(x, dictionary=dictionary)

   x <- paste(x, sep="", collapse=" ")

   PlainTextDocument(stripWhitespace(x))

 }

 myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)

 myCorpus <- Corpus(VectorSource(myCorpus))

See a detailed example in page 12 of slides at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf

Regards

Yanchang Zhao

RdataMining.com