How to detect word boundary in regex for Arabic words - Python
I am trying to remove any word that might contain non-Arabic characters. So, words like ذهb
or word
should be removed.
I have managed to remove the non-Arabic characters using the below regex:
re.sub(r'([^،-٩]+)',' ', 'ذهb')
But how would I remove the whole word? Preceding the regex with \b
doesn't seem to work.
Solution 1:
You can use
re.sub(r'\s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b', '', text)
The \s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b
matches
-
\s*
- zero or more whitespaces -
\b
- a word boundary -
[\u0621-\u064A]*
- zero or more Arabic letters -
[^\W\d_\u0621-\u064A]
- any Unicode letter but Arabic letter -
[^\W\d_]*
- any zero or more Unicode letters -
\b
- a word boundary