Detect strings with non English characters in Python

Solution 1:

You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.

Note the comment # -*- coding: ..... It should be there at the top of the python file (otherwise you would receive some error about encoding)

# -*- coding: utf-8 -*-
def isEnglish(s):
    except UnicodeDecodeError:
        return False
        return True

assert not isEnglish('slabiky, ale liší se podle významu')
assert isEnglish('English')
assert not isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ')
assert not isEnglish('how about this one : 通 asfަ')
assert isEnglish('?fd4))45s&')

Solution 2:

IMHO it is the simpliest solution:

def isEnglish(s):
  return s.isascii()



Solution 3:

If you work with strings (not unicode objects), you can clean it with translation and check with isalnum(), which is better than to throw Exceptions:

import string

def isEnglish(s):
    return s.translate(None, string.punctuation).isalnum()

print isEnglish('slabiky, ale liší se podle významu')
print isEnglish('English')
print isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ')
print isEnglish('how about this one : 通 asfަ')
print isEnglish('?fd4))45s&')
print isEnglish('Текст на русском')

> False
> True
> False
> False
> True
> False

Also you can filter non-ascii characters from string with this function:

ascii = set(string.printable)   

def remove_non_ascii(s):
    return filter(lambda x: x in ascii, s)

remove_non_ascii('slabiky, ale liší se podle významu')
> slabiky, ale li se podle vznamu