R tm package invalid input in 'utf8towcs'

Solution 1:

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

The code is this simple

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ") 

Solution 2:

This is from the tm faq:

it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

I hope this helps, for me it does.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

Solution 3:

I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

Solution 4:

I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform 

This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

Solution 5:

I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to it’s).

I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.