How to choose between whitespace pattern?
In the Oracle Pattern documentation there is the description of three different pattern for matching whitespace :
- \s
- \p{Space}
- \p{javaWhitespace}
I'm wondering what are the specificity of each and how to know how to choose the right one.
I've just noticed that \p{javaWhitespace}
include more space type.
I would rather use the first.
- It is compact
- It is the same notation in many other languages, as well as in theory of regexp
-
\p{javaWhitespace}
includeFILE SEPARATOR
,GROUP SEPARATOR
,etc... see this. Using it when these are not needed may confuse somebody else. - In general I would expect another programmer to know what
\s
is while I'll expect them to double check what is the exact definition of\p{javaWhitespace}
. You don't want that, as it diminish code clarity and add unnecessary burden during debugging.
\s
is the shortest and also the most non-portable option to specify a space character. Although it is rare to port Java code to other languages, it is more about porting the knowledge of the syntax of one regex engine to another. There are many regex engines using Perl like syntax, so difference in interpretation for the same syntax like \s
confuses the programmers.
Apart from space (ASCII 32), new line (\n
, ASCII 10), horizontal tab (\t
, ASCII 9), carriage return (\r
, ASCII 13) and form feed (\f
, ASCII 12), there is no consensus between different engines of what is a space character.
Java, POSIX (ASCII): Also includes vertical tab (ASCII 11). Java seems to follow POSIX standard here.
-
JavaScript (Edition 5.1): According to the specs (word by word), apart from the 5 common ones, it includes:
-
Unicode category Zs (Separator/Space),
\u2028
(Line Separator),\u2029
(Paragraph Separator). It basically includes all characters under category Z (Separator).Actually
\u2028
is the sole member of category Zl (Separator/Line), and\u2029
is the sole member of category Zp (Separator/Paragraph). By the wording, it might be possible that the current version of the specs exclude any further extension to those 2 category. - Vertical tab
\v
-
Byte-Order Mark a.k.a. ZERO WIDTH NO-BREAK SPACE
\ufeff
-
Perl, PCRE (ASCII mode): Vertical tab
\v
added from Perl 5.18 as experiment. Before 5.18, it only matches the 5 common ones.-
Perl (Unicode mode): Apart from the 5 common ones
- Unicode category Z (Separator)
- Vertical tab
\v
added from Perl 5.18 as experiment. - NEXT LINE (NEL)
\u0085
-
MONGOLIAN VOWEL SEPARATOR
\u180e
-
.NET (default): Apart from 5 common ones
- Unicode category Z (Separator)
- Vertical tab
\v
- NEXT LINE (NEL)
\u0085
-
Java (Unicode): From Java 7, Pattern class includes a new flag
UNICODE_CHARACTER_CLASS
which makes Predefined character classes and POSIX character classes conform to Unicode Technical Standard #18: Unicode Regular Expression. When the flag is active, Predefined character class and the corresponding POSIX character class will become equivalent (match the same thing).The list of characters is the same as .NET's.
That is enough to drive one crazy!
\p{Space}
is the more "stable" option since it follows the POSIX standard in default mode, and Unicode Technical Standard #18: Unicode Regular Expression in UNICODE_CHARACTER_CLASS
.
If you use POSIX character class, POSIX-compliant implementation will have the same behavior in ASCII mode, and Unicode regex engines which follow the recommendation will have the (almost) the same behavior in Unicode mode.
\s
and \p{Space}
are equivalent in Java, regardless of the flag. If you use \s
in Java, you can be sure you are following some standard/recommendation. Just that it does not announce to most programmers about this fact.
\p{isJavaWhitespace}
to match whitespace according to Java's definition. The name of the function is extremely misleading.