How to replace special characters with gsub in R?
I have a text that is written with the old version of romanian letters.
Old | New |
---|---|
ş (s with a cedilla) UTF-8: c59f |
ș (s with a comma) UTF-8: c899 |
ţ (s with a cedilla) UTF-8: c5a3 |
ț (t with a comma) UTF-8: c89b |
When I export the text from R into a text file, this causes problems (this special letters are exported as s and t). I've manually changed some of the letters, and there where exported correctly.
How can I replace in R the old and new versions of these letters?
So far I have tried:
x<-"ş__s"
gsub("ş","ș",x) # this replaces the letter s also (output: s__s)
gsub("\xc5\x9f","\xc8\x99",x) # this does nothing
gsub("c59f","c899",x) # this does nothing
I hope this is explained clear enough. Thank you in advance for your responses.
Solution 1:
If writing the characters as-is does not work, you can try using the unicode expression. Here is the unicode expressions of the relevant letters from Wikipedia.
ş U+015F (351) https://en.wikipedia.org/wiki/%C5%9E
ţ U+0163 (355) https://en.wikipedia.org/wiki/%C5%A2
ș U+0219 (537) https://en.wikipedia.org/wiki/S-comma
ț U+021B (539) https://en.wikipedia.org/wiki/T-comma
You can do the conversion in R as below.
Utf8ToInt
is convenient to verify that the letters are converted as intended.
x <- "ş__ţ"
utf8ToInt(x)
# 351 95 95 355
x2 <- gsub("\u015F", "\u0219", x)
utf8ToInt(x2)
# 537 95 95 355
x3 <- gsub("\u0163", "\u021B", x)
utf8ToInt(x3)
# 351 95 95 539
By the way, since this is letter-to-letter conversion, chartr
function is more efficient than gsub
because you can convert multiple pairs of letters at once like below.
x4 <- chartr("\u015F\u0163", "\u0219\u021B", x)
utf8ToInt(x4)
# 537 95 95 539