Concrete Javascript Regex for Accented Characters (Diacritics)
Solution 1:
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above but not including [ ] ^ \ × ÷
See https://unicode-table.com/en/ for characters listed in numeric order.
Solution 2:
The accented Latin range \u00C0-\u017F
was not quite enough for my database of names, so I extended the regex to
[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
I added these code blocks (\u00C0-\u024F
includes three adjacent blocks at once):
-
\u00C0-\u00FF
Latin-1 Supplement -
\u0100-\u017F
Latin Extended-A -
\u0180-\u024F
Latin Extended-B -
\u1E00-\u1EFF
Latin Extended Additional
Note that \u00C0-\u00FF
is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7
and divide ÷ \u00F7
.
[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.
The original regex stopping at \u017F
borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218
, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E
, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")
Solution 3:
Which of these three approaches is most suited for the task?
Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S
character class.
I'm forcing a field in a UI to match the format:
last_name, first_name
(last [comma space] first)
The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:
/[^,]+,\s[^,]+/
But your second solution with the .
character class is just as fine, you only might need to care about multiple commata then.
Solution 4:
The XRegExp library has a plugin named Unicode that helps solve tasks like this.
<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
</script>
It's mentioned in the comments to the question, but it's easy to miss. I've noticed it only after I submitted this answer.
Solution 5:
How about this?
/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/