Regex to match Egyptian Hieroglyphics [closed]
TLDNR: \p{Egyptian_Hieroglyphs}
Javascript
Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is
U+13000 = d80c dc00
the last one is
U+1342E = d80d dc2e
that gives
re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g
t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))
<div id="pyramid">
some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮
</div>
This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:
Other languages
On platforms that support UCS-4 you can use Egyptian codepoints 13000
to 1342F
directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]
:
>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']
Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:
$str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮";
preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);
prints
[0] => Array
(
[0] => 𓀀
[1] => 𓀁
[2] => 𓐬
[3] => 𓐭
[4] => 𓐮
)
Unicode encodes Egyptian hieroglyphs in the range from U+13000 – U+1342F (beyond the Basic Multilingual Plane).
In this case, there are 2 ways to write the regex:
-
By specifying a character range from U+13000 – U+1342F.
While specifying a character range in regex for characters in BMP is as easy as
[a-z]
, depending on the language support, doing so for characters in astral planes might not be as simple. -
By specifying Unicode block for Egyptian hieroglyphs
Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.
Java
(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern
classes).
Sun/Oracle implementation
I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.
Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.
Pattern
class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.
You can read more about the madness in Java regex in this answer by tchist.
I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.
Java 5 (and above)
"[\uD80C\uDC00-\uD80D\uDC2F]"
Java 7 (and above)
"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"
Since we are matching any code point belongs to the Unicode block, it can also be written as:
"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"
"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"
Java supported \p
syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.
PCRE (used in PHP)
PHP example is already covered in georg's answer:
'~\p{Egyptian_Hieroglyphs}~u'
Note that u
flag is mandatory if you want to match by code points instead of matching by code units.
Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u
flag (UTF mode) in this answer of mine.
One thing to note is Egyptian_Hieroglyphs
is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).
As an alternative, you can specify a character range with \x{h...hh}
syntax:
'~[\x{13000}-\x{1342F}]~u'
Note the mandatory u
flag.
The \x{h...hh}
syntax is supported from at least PCRE 4.50.
JavaScript (ECMAScript)
ES5
The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.
/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/
The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.
JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern
class in Java 5 to allow it to work with code points, JavaScript RegExp
is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.
ES6
Finally, support for code point matching is added in ECMAScript 6, which is made available via u
flag to prevent breaking existing implementations in previous versions of ECMAScript.
- ES6 Specification - 21.2 RegExp (Regular Expression) Objects
- Unicode-aware regular expressions in ECMAScript 6
Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp
.
With the introduction of \u{h...hh}
syntax in ES6, the character range can be rewritten in a manner similar to Java 7:
/[\u{13000}-\u{1342F}]/u
Or you can also directly specify the character in the RegExp
literal, though the intention is not as clear cut as [a-z]
:
/[𓀀-𓐯]/u
Note the u
modifier in both regexes above.
Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.