Opening zipfile of unsupported compression-type silently returns empty filestream, instead of throwing exception

The cause is the combination of:

  • this file's compression type is type 9: Deflate64/Enhanced Deflate (PKWare's proprietary format, as opposed to the more common type 8)
  • and a zipfile bug: it will not throw an exception for unsupported compression-types. It used to just silently return a bad file object [Section 4.4.5 compression method]. Aargh. How bogus. UPDATE: I filed bug 14313 and it was fixed back in 2012 so it now raises NotImplementedError when the compression type is unknown.

A command-line Workaround is to unzip, then rezip, to get a plain type 8: Deflated.

zipfile will throw an exception in 2.7 , 3.2+ I guess zipfile will never be able to actually handle type 9, for legal reasons. The Python doc makes no mention whatsoever that zipfile cannot handle other compression types :(


My solution for handling compression types that aren't supported by Python's ZipFile was to rely on a call to 7zip when ZipFile.extractall fails.

from zipfile import ZipFile
import subprocess, sys

def Unzip(zipFile, destinationDirectory):
    try:
        with ZipFile(zipFile, 'r') as zipObj:
            # Extract all the contents of zip file in different directory
            zipObj.extractall(destinationDirectory)
    except:
        print("An exception occurred extracting with Python ZipFile library.")
        print("Attempting to extract using 7zip")
        subprocess.Popen(["7z", "e", f"{zipFile}", f"-o{destinationDirectory}", "-y"])

Compression type 9 is Deflate64/Enhanced Deflate, which Python's zipfile module doesn't support (essentially since zlib doesn't support Deflate64, which zipfile delegates to).

And if smaller files work fine, I suspect this zipfile was created by Windows Explorer: for larger files Windows Explorer can decided to use Deflate64.

(Note that Zip64 is different to Deflate64. Zip64 is supported by Python's zipfile module, and just makes a few changes to how some metadata is stored in the zipfile, but still uses regular Deflate for the compressed data.)

However, stream-unzip now supports Deflate64. Modifying its example to read from the local disk, and to read a CSV file as in your example:

import csv
from io import IOBase, TextIOWrapper
import os

from stream_unzip import stream_unzip

def get_zipped_chunks(zip_pathname):
    with open(zip_pathname, 'rb') as f:
       while True:
           chunk = f.read(65536)
           if not chunk:
               break
           yield chunk

def get_unzipped_chunks(zipped_chunks, filename)
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
        if file_name != filename:
            for chunk in unzipped_chunks:
                pass
            continue
        yield from unzipped_chunks

def to_str_lines(iterable):
    # Based on the answer at https://stackoverflow.com/a/70639580/1319998
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj(IOBase):
        def readable(self):
            return True
        def read(self, size=-1):
            return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))

    yield from TextIOWrapper(FileLikeObj(), encoding='utf-8', newline='')

zipped_chunks = get_zipped_chunks(os.path.join('/my/data/path/.../', 'train.zip'))
unzipped_chunks = get_unzipped_chunks(zipped_chunks, b'train.csv')
str_lines = to_str_lines(unzipped_chunks)
csv_reader = csv.reader(str_lines)

for row in csv_reader:
    print(row)