Javascript RegExp + Word boundaries + unicode characters

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using \b, try using (?:^|\\s)

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

\s matches whitespace (appears as \\s in the string because we have to escape the backslash)

) closes the group

So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.

This makes the RegEx character classes largely useless for dealing with any real language.

\s should work for what you want to do, provided that search terms are only delimited by whitespace.

this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters. Using XRegExp library you can implement a valid \b boundary expanding this

XRegExp('(?=^|$|[^\\p{L}])')

the result is a 4000+ char long, but it seems to work quite performing.

Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.

I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.

\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.

Updating and improving on max_masseti's answer:

With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.

EDIT: Previous version was incomplete.

As such:

const text = 'A Fé, o Império, e as terras viciosas';

text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);

// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']

We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.

Javascript RegExp + Word boundaries + unicode characters

Related

Recent Posts