Get files names inside a zip file on FTP server without downloading whole archive

You can implement a file-like object that reads data from FTP, instead of a local file. And pass that to ZipFile constructor, instead of a (local) file name.

A trivial implementation can be like:

from ftplib import FTP
from ssl import SSLSocket

class FtpFile:

    def __init__(self, ftp, name):
        self.ftp = ftp
        self.name = name
        self.size = ftp.size(name)
        self.pos = 0
    
    def seek(self, offset, whence):
        if whence == 0:
            self.pos = offset
        if whence == 1:
            self.pos += offset
        if whence == 2:
            self.pos = self.size + offset

    def tell(self):
        return self.pos

    def read(self, size = None):
        if size == None:
            size = self.size - self.pos
        data = B""

        # Based on FTP.retrbinary 
        # (but allows stopping after certain number of bytes read)
        # An alternative implementation is at
        # https://stackoverflow.com/q/58819210/850848#58819362
        ftp.voidcmd('TYPE I')
        cmd = "RETR {}".format(self.name)
        conn = ftp.transfercmd(cmd, self.pos)
        try:
            while len(data) < size:
                buf = conn.recv(min(size - len(data), 8192))
                if not buf:
                    break
                data += buf
            # shutdown ssl layer (can be removed if not using TLS/SSL)
            if SSLSocket is not None and isinstance(conn, SSLSocket):
                conn.unwrap()
        finally:
            conn.close()
        try:
            ftp.voidresp()
        except:
            pass
        self.pos += len(data)
        return data

And then you can use it like:

ftp = FTP(host, user, passwd)
ftp.cwd(path)

ftpfile = FtpFile(ftp, "archive.zip")
zip = zipfile.ZipFile(ftpfile)
print(zip.namelist())

The above implementation is rather trivial and inefficient. It starts numerous (three at minimum) downloads of small chunks of data to retrieve a list of contained files. It can be optimized by reading and caching larger chunks. But it should give your the idea.

Particularly you can make use of the fact that you are going to read a listing only. The listing is located at the and of a ZIP archive. So you can just download last (about) 10 KB worth of data at the start. And you will be able to fulfill all read calls out of that cache.

Knowing that, you can actually do a small hack. As the listing is at the end of the archive, you can actually download the end of the archive only. While the downloaded ZIP will be broken, it still can be listed. This way, you won't need the FtpFile class. You can even download the listing to memory (StringIO).

zipstring = StringIO()
name = "archive.zip"
size = ftp.size(name)
ftp.retrbinary("RETR " + name, zipstring.write, rest = size - 10*2024)

zip = zipfile.ZipFile(zipstring)

print(zip.namelist())

If you get BadZipfile exception because the 10 KB is too small to contain whole listing, you can retry the code with a larger chunk.

Get files names inside a zip file on FTP server without downloading whole archive

Related

Recent Posts