Using Unicode characters for file names inside a zip archive

I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.

I am able to zip the file but the problem is the name of file is changing automatically to P+¬r+¬quation LES HOPITAUX NEUFS.xls.

How can I support unicode characters for file names inside a zip archive?


Solution 1:

It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.

You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:

ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);                               
ostream.setCreateUnicodeExtraFields(
    ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);

If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor

The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.

Solution 2:

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8

For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)

The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.

ZipFile(File, Charset)

ZipInputStream(InputStream, Charset)

ZipOutputStream(OutputStream, Charset)

With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.

zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...