How to reliably detect file types? [duplicate]
Objective: given the file, determine whether it is of a given type (XML, JSON, Properties etc)
Consider the case of XML - Up until we ran into this issue, the following sample approach worked fine:
try {
saxReader.read(f);
} catch (DocumentException e) {
logger.warn(" - File is not XML: " + e.getMessage());
return false;
}
return true;
As expected, when XML is well formed, the test would pass and method would return true. If something bad happens and file can't be parsed, false will be returned.
This breaks however when we deal with a malformed XML (still XML though) file.
I'd rather not rely on .xml
extension (fails all the time), looking for <?xml version="1.0" encoding="UTF-8"?>
string inside the file etc.
Is there another way this can be handled?
What would you have to see inside the file to "suspect it may be XML though DocumentException
was caught". This is needed for parsing purposes.
File type detection tools:
- Mime Type Detection Utility
- DROID (Digital Record Object Identification)
- ftc - File Type Classifier
- JHOVE, JHOVE2
- NLNZ Metadata Extraction Tool
- Apache Tika
- TrID, TrIDNet
- Oracle Outside In (commercial)
- Forensic Innovations File Investigator TOOLS (commercial)
Apache Tika gives me the least amount of issues and is not platform specific unlike Java 7 : Files.probeContentType
import java.io.File;
import java.io.IOException;
import javax.activation.MimeType;
import org.apache.tika.Tika;
File inputFile = ...
String type = new Tika().detect(inputFile);
System.out.println(type);
For a xml file I got 'application/xml'
for a properties file I got 'text/plain'
You can however add a Detector to the new Tika()
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.xx</version>
</dependency>
For those who do not need very precise detection (the Java 7's Files.probeContentType method mentioned by rjdkolb)
Path filePath = Paths.get("/path/to/your/file.jpg");
String contentType = Files.probeContentType(filePath);