How can I detect if a file is binary (non-text) in Python?
Solution 1:
Yet another method based on file(1) behavior:
>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))
Example:
>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False
Solution 2:
You can also use the mimetypes module:
import mimetypes
...
mime = mimetypes.guess_type(file)
It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.
Solution 3:
If you're using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError
. Python3 will use unicode when handling files in text mode (and bytearray in binary mode) - if your encoding can't decode arbitrary files it's quite likely that you will get UnicodeDecodeError
.
Example:
try:
with open(filename, "r") as f:
for l in f:
process_line(l)
except UnicodeDecodeError:
pass # Fond non-text data
Solution 4:
If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.