How to detect word boundary in regex for Arabic words - Python

I am trying to remove any word that might contain non-Arabic characters. So, words like ذهb or word should be removed.

I have managed to remove the non-Arabic characters using the below regex:

re.sub(r'([^،-٩]+)',' ', 'ذهb')

But how would I remove the whole word? Preceding the regex with \b doesn't seem to work.

Solution 1:

You can use

re.sub(r'\s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b', '', text)

The \s*\b[\u0621-\u064A]*[^\W\d_\u0621-\u064A][^\W\d_]*\b matches

  • \s* - zero or more whitespaces
  • \b - a word boundary
  • [\u0621-\u064A]* - zero or more Arabic letters
  • [^\W\d_\u0621-\u064A] - any Unicode letter but Arabic letter
  • [^\W\d_]* - any zero or more Unicode letters
  • \b - a word boundary