stemCompletion is not working
Solution 1:
I received the same error when using tm v0.6. I suspect this occurs because stemCompletion
is not in the default transformations for this version of the tm package:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>
Now, the tolower
function has the same problem, but can be made operational by using the content_transformer
function. I tried a similar approach for stemCompletion
but was not successful.
Note, even though stemCompletion
isn't a default transformation, it still works when manually fed stemmed words:
> stemCompletion("compani",dictCorpus)
compani
"companies"
So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion
, and concatenated them back together with the following (clunky and not graceful!) function:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
where dictCorpus
is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace
is specific for my corpus, but is likely benign for a general corpus. You may want to change the type
option from "shortest" as needed.
For a full example, let's setup a dummy corpus using the crude
data in the tm package:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...
To run the stemCompletion_mod
through the entire corpus, I just use sapply
(or parSapply
with snow package).
Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion
to work in v0.6 of the tm package.
Solution 2:
I had success with the following workflow:
- use
content_transformer
to apply an anonymous function on each document of the corpus, - split the document to words by spaces,
- call
stemCompletion
on the words with the help of the dictionary, - and concatenate the separate words into a document again with
paste
.
POC demo code:
tm_map(c, content_transformer(function(x, d)
paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)
PS: using c
as a variable name to store the corpus is not a good idea due to base::c
Solution 3:
Thanks, cdxsza. Your method worked for me.
A note to all who are going to use
stemCompletion
:The function completes an empty string with a word in dictionary, which is unexpected. See an example below, where the first "monday" was produced for the blank at the beginning of the string.
stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))
[1] "monday" "monday" "tuesday"
It can be easily fixed by removing empty string
""
beforestemCompletion
as below.
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
See a detailed example in page 12 of slides at http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf
Regards
Yanchang Zhao
RdataMining.com