What's the correct regex range for javascript's regexes to match all the non word characters in any script?

In python or PHP a simple regex such as /\W/gu matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_], what are the correct ranges to match the same characters as python and PHP?

https://regex101.com/r/yhNF8U/1/

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W will look like:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.

More considerations

The \w construct (and thus its \W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: \W .NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}], where \p{Ll}\p{Lu}\p{Lt}\p{Lo} can be contracted to a sheer \p{L} and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}].

In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}], where \p{gc=Mn}\p{gc=Me}\p{gc=Mc} can be just written as \p{M}.

In PHP PCRE, \W matches [^\p{L}\p{N}_].

Rexegg cheat sheet defines Python 3 \w as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_].

You may roughly decompose \W as [^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

where

[^ - is the start of the negated character class that matches a single char other than:
- \p{L} - any Unicode letter
- \p{N} - any Unicode digit
- \p{M} - a diacritic mark
- \p{Pc} - a connector punctuation symbol
] - end of the character class.

Note it is \p{Pc} class that matches an underscore.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. Ⅻ – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

Other variations:

/[^\p{L}0-9_]/gu - to just use \W that is aware of Unicode letters only
/[^\p{L}\p{N}_]/gu - (PCRE \W style) to just use \W that is aware of Unicode letters and digits only.

Note that Java's (?U)\W will match a mix of what \W matches in PCRE, Python and .NET.

What's the correct regex range for javascript's regexes to match all the non word characters in any script?

Related

Recent Posts