(grep) Regex to match non-ASCII characters?
On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find
and then do a grep to print the non-ASCII characters, and then do a wc -l
to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
Solution 1:
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
-
[[:ascii:]]
- matches a single ASCII char -
[^[:ascii:]]
- matches a single non-ASCII char
[^[:print:]]
will probably suffice for you.**
Solution 2:
No, [^\x20-\x7E]
is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
Solution 3:
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
Solution 4:
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.