How can I create a zip / tgz in Linux such that Windows has proper filenames?

Currently, tar encodes filenames in UTF

Actually tar doesn't encode/decode filenames at all, It simply copies them out of the filesystem as-is. If your locale is UTF-8-based (as in many modern Linux distros), that'll be UTF-8. Unfortunately the system codepage of a Windows box is never UTF-8, so the names will always be mangled except on tools such as WinRAR that allow the charset used to be changed.

So it is impossible to create a ZIP file with non-ASCII filenames that work across different countries' releases of Windows and their built-in compressed folder support.

It is a shortcoming of the tar and zip formats that there is no fixed or supplied encoding information, so non-ASCII characters will always been non-portable. If you need a non-ASCII archive format you'll have to use one of the newer formats, such as recent 7z or rar. Unfortunately these are still wonky; in 7zip you need the -mcu switch, and rar still won't use UTF-8 unless it detects characters not in the codepage.

Basically it's a horrible mess and if you can avoid distributing archives containing filenames with non-ASCII characters you'll be much better off.


Here is a simple Python script that I've written to unpack tar files from UNIX on Windows:

import tarfile

archive_name = "archive_name.tar"

def recover(name):
    return unicode(name, 'utf-8')

tar = tarfile.open(name=archive_name, mode='r', bufsize=16*1024)
updated = []
for m in tar.getmembers():
    m.name = recover(m.name)
    updated.append(m)

tar.extractall(members=updated)
tar.close()