How to remove unicode <U+00A6> from string?
I just want to remove unicode
<U+00A6>
which is at the beginning of string.
Then you do not need a gsub
, you can use a sub
with "^\\s*<U\\+\\w+>\\s*"
pattern:
q <-"<U+00A6> 1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)
Pattern details:
-
^
- start of string -
\\s*
- zero or more whitespaces -
<U\\+
- a literal char sequence<U+
-
\\w+
- 1 or more letters, digits or underscores -
>
- a literal>
-
\\s*
- zero or more whitespaces.
If you also need to replace the -
with a space, add |-
alternative and use gsub
(since now we expect several replacements and the replacement must be a space - same is in akrun's answer):
trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))
See the R online demo
If always is the first character, you can try:
substring("\U00A6 1000-66B29", 2)
if R prints the string as <U+00A6> 1000-66329
instead of ¦ 1000-66B29
then <U+00A6>
is interpreted as the string "<U+00A6>"
instead of the unicode character. Then you can do:
substring("<U+00A6> 1000-66329",9)
Both ways the result is:
[1] " 1000-66329"
We can also do
trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"
Instead of removing you should convert it to the appropriate format ... You have to set your local to UTF-8 like so:
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
Maybe you will see the following message:
Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In this case you should use stringi::stri_trans_general(x, "zh")
Here "zh" means "chinese". You should know which language you have to convert to. That's it