Determining binary/text file type in Java?
There's no guaranteed way, but here are a couple of possibilities:
-
Look for a header on the file. Unfortunately, headers are file-specific, so while you might be able to find out that it's a RAR file, you won't get the more generic answer of whether it's text or binary.
-
Count the number of character vs. non-character types. Text files will be mostly alphabetical characters while binary files - especially compressed ones like rar, zip, and such - will tend to have bytes more evenly represented.
-
Look for a regularly repeating pattern of newlines.
Using Java 7 Files class http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)
boolean isBinaryFile(File f) throws IOException {
String type = Files.probeContentType(f.toPath());
if (type == null) {
//type couldn't be determined, assume binary
return true;
} else if (type.startsWith("text")) {
return false;
} else {
//type isn't text
return true;
}
}
Run file -bi {filename}
. If whatever it returns starts with 'text/', then it's non-binary, otherwise it is. ;-)
I made this one. A bit simpler, but for latin-based languages, it should work fine, with the ratio adjustment.
/**
* Guess whether given file is binary. Just checks for anything under 0x09.
*/
public static boolean isBinaryFile(File f) throws FileNotFoundException, IOException {
FileInputStream in = new FileInputStream(f);
int size = in.available();
if(size > 1024) size = 1024;
byte[] data = new byte[size];
in.read(data);
in.close();
int ascii = 0;
int other = 0;
for(int i = 0; i < data.length; i++) {
byte b = data[i];
if( b < 0x09 ) return true;
if( b == 0x09 || b == 0x0A || b == 0x0C || b == 0x0D ) ascii++;
else if( b >= 0x20 && b <= 0x7E ) ascii++;
else other++;
}
if( other == 0 ) return false;
return 100 * other / (ascii + other) > 95;
}
Have a look at the JMimeMagic library.
jMimeMagic is a Java library for determining the MIME type of files or streams.