How do I check if a string is unicode or ascii?
What do I have to do in Python to figure out which encoding a string has?
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
How to tell if an object is a unicode string or a byte string
You can use type
or isinstance
.
In Python 2:
>>> type(u'abc') # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc') # Python 2 byte string literal
<type 'str'>
In Python 2, str
is just a sequence of bytes. Python doesn't know what
its encoding is. The unicode
type is the safer way to store text.
If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.
In Python 3:
>>> type('abc') # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc') # Python 3 byte string literal
<class 'bytes'>
In Python 3, str
is like Python 2's unicode
, and is used to
store text. What was called str
in Python 2 is called bytes
in Python 3.
How to tell if a byte string is valid utf-8 or ascii
You can call decode
. If it raises a UnicodeDecodeError exception, it wasn't valid.
>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)