Python unicode regular expression matching failing with some unicode characters -bug or mistake?
I am attempting to use the re
module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals
to the top of my code so all strings literals should be unicode objects.
However, I am running into some odd problems with Python's regex matching. For instance, consider this name: "किशोरी". This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.
The following returns a match, as it should:
re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)
But this does not:
re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)
Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on "derived core properties" lists other characters (I have not checked all) in this string as alphabetic ones - as indeed they are.
Is this just a bug in Python's implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?
Solution 1:
It is a bug in the re
module and it is fixed in the regex
module:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex # $ pip install regex
word = "किशोरी"
def test(re_):
assert re_.search("^\\w+$", word, flags=re_.UNICODE)
print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])
test(regex)
test(re) # fails
The output shows that there are 6 codepoints in "किशोरी"
, but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:
Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.
here and further emphasis is mine
A word boundary \b
is defined as a transition from \w
to \W
(or in reverse) in the docs:
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, ...
Therefore either all codepoints that form a single character are \w
or they are all \W
.
In this case "किशोरी"
matches ^\w{6}$
.
From the docs for \w
in Python 2:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
in Python 3:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
From regex
docs:
Definition of 'word' character (issue #1693050):
The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and \B.
According to unicode.org U+093F
(DEVANAGARI VOWEL SIGN I
) is alnum and alphabetic so regex
is also correct to consider it \w
even if we follow definitions that are not based on word boundaries.
Solution 2:
From Character Map:
ि
U+093F DEVANAGARI VOWEL SIGN I
General Character Properties
In Unicode since: 1.1 Unicode category: Mark, Spacing Combining
So, technically speaking this is not a letter and doesn't fall under \w
even with re.UNICODE
. You can try using regex
with Unicode character properties instead in order to include these sorts of characters.
Solution 3:
I tested the following:
import unicodedata
for c in "किशोरी":
print unicodedata.category(c)
print unicodedata.name(c)
which displays in my case:
Lo
DEVANAGARI LETTER KA
Mc
DEVANAGARI VOWEL SIGN I
Lo
DEVANAGARI LETTER SHA
Mc
DEVANAGARI VOWEL SIGN O
Lo
DEVANAGARI LETTER RA
Mc
DEVANAGARI VOWEL SIGN II
Unicode stuff is hard to debug because copy and paste can mess up the data and I don't know hindi. But in some languages you can encode characters in different ways in unicode. Is it possible, that you have to normalize your string somehow before matching? To me it looks ok that a vowel sign is not matched by \w
.