Removing non-ASCII characters from data files
I've got a bunch of csv
files that I'm reading into R and including in a package/data folder in .rdata
format. Unfortunately the non-ASCII characters in the data fail the check. The tools
package has two functions to check for non-ASCII characters (showNonASCII
and showNonASCIIfile
) but I can't seem to locate one to remove/clean them.
Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?
Solution 1:
These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:
x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom" "Joreskog" "bisschen Zurcher"
Solution 2:
To simply remove the non-ASCII characters, you could use base R's iconv()
, setting sub = ""
. Something like this should work:
x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1" # (just to make sure)
x
# [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm" "Jreskog" "bichen Zrcher"
To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:
## Do *any* lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Solution 3:
To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun
with filter
from dplyr
x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")
x
x %>%
tibble(name = .) %>%
filter(xfun::is_ascii(name)== T)
Solution 4:
I often have trouble with iconv and I'm a base R fan.
So instead to remove unicode or non-ASCII I use gsub, using lapply to apply it to an entire dataframe.
gsub("[^\u0001-\u007F]+|<U\\+\\w+>","", string)
The benefit of this gsub is that it will match a range of notation formats. Below I show the individual matches for the two patterns.
x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
gsub("[^\u0001-\u007F]+","", x1)
## "Ekstrm" "Jreskog" "bichen Zrcher"
x2 <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
gsub("[^\u0001-\u007F]+","", x2)
## Same as x1
## "Ekstrm" "Jreskog" "bichen Zrcher"
x3 <- c("<U+FDFA>", "1<U+2009>00", "X<U+203E>")
gsub("<U\\+\\w+>","", x3)
## "" "100" "X"
Solution 5:
textclean::replace_non_ascii()
did the job for me. This function removes not only special letters, but euro, trademark and other symbols also.
x <- c("Ekstr\u00f8m \u2605", "J\u00f6reskog \u20ac", "bi\u00dfchen Z\u00fcrcher \u2122")
stringi::stri_trans_general(x, "latin-ascii")
[1] "Ekstrom ★" "Joreskog €" "bisschen Zurcher ™"
textclean::replace_non_ascii(x)
[1] "Ekstrom" "Joreskog" "bisschen Zurcher cent"