R tm package invalid input in 'utf8towcs'
Solution 1:
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).
The code is this simple
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")
Solution 2:
This is from the tm faq:
it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.
I hope this helps, for me it does.
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
http://tm.r-forge.r-project.org/faq.html
Solution 3:
I think it is clear by now that the problem is because of the emojis that tolower is not able to understand
#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')
Solution 4:
I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)
What I was seeing is that using the solution from the FAQ
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
was giving me this warning:
Warning message:
it is not known that wchar_t is Unicode on this platform
This I traced to the enc2utf8
function. Bad news is that this is a problem with my underlying OS and not R.
So here is what I did as a work around:
tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.
Solution 5:
I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s
to it’s
).
I have found that there is actually a much better solution for this! If you install the stringi
package, you can replace tolower()
with stri_trans_tolower()
and then everything should work fine.