Is there a way to download parts of the content of a zip file?

If there is a big zip file uploaded on a server, and all you need is some of it's content, is there a way to open it, and choose what you want to download?


Solution 1:

I wrote a Python script list_remote_zip.py that can list files in a zip file that is accessible over HTTP:

import urllib2, struct, sys

def open_remote_zip(url, offset=0):
 return urllib2.urlopen(urllib2.Request(url, headers={'Range': 'bytes={}-'.format(offset)}))

offset = 0
zipfile = open_remote_zip(sys.argv[1])
header = zipfile.read(30)

while header[:4] == 'PK\x03\x04':
 compressed_len, uncompressed_len = struct.unpack('<II', header[18:26])
 filename_len, extra_len = struct.unpack('<HH', header[26:30])
 header_len = 30 + filename_len + extra_len
 total_len = header_len + compressed_len

 print('{}\n offset: {}\n length: {}\n  header: {}\n  payload: {}\n uncompressed length: {}'.format(zipfile.read(filename_len), offset, total_len, header_len, compressed_len, uncompressed_len))
 zipfile.close()

 offset += total_len
 zipfile = open_remote_zip(sys.argv[1], offset)
 header = zipfile.read(30)

zipfile.close()

It does not use the zip file's central directory, which is near the end of the file. Instead, it goes from the start and parses individual local headers and skips over the payload, hoping to land at another header. It sends a new request every time it needs to skip to an offset. This of course only works with servers that support the Range HTTP header.

It only needs to be passed the URL to the zip file as a command line argument. Example usage and output should look something like this:

$ python list_remote_zip.py http://dl.xonotic.org/xonotic-0.8.1.zip
Xonotic/Makefile
 offset: 0
 length: 1074
  header: 46
  payload: 1028
 uncompressed length: 5019
Xonotic/source/darkplaces/
 offset: 1074
 length: 56
  header: 56
  payload: 0
 uncompressed length: 0
Xonotic/source/darkplaces/bih.h
 offset: 1130
 length: 1166
  header: 61
  payload: 1105
 uncompressed length: 2508
Xonotic/source/darkplaces/portals.h
 offset: 2296
 length: 334
  header: 65
  payload: 269
 uncompressed length: 648
...

To download one of the files, I wrote an even uglier get_file_from_remote_zip.sh bash script around it that uses wget:

info=$(python list_remote_zip.py "$1" | grep -m 1 -A 5 "^$2\$" | tail -n +2)
tmpfile=$(mktemp)

wget --start-pos $(echo "$info" | grep offset | grep -o '[[:digit:]]*') -O - "$1" | head -c $(echo "$info" | grep -m 1 length | grep -o '[[:digit:]]*') >"$tmpfile"

printf '\x1f\x8b' # gzip magic
tail -c +9 <"$tmpfile" | head -c 1 # copy compression method
printf '\0\0\0\0\0\0\x03' # some flags and mtime
tail -c "+$(expr 1 + $(echo "$info" | grep header | grep -o '[[:digit:]]*'))" <"$tmpfile"
tail -c +15 <"$tmpfile" | head -c 4 # The CRCs seem to be compatible.
tail -c +23 <"$tmpfile" | head -c 4

rm "$tmpfile"

It takes 2 arguments. The first is the URL of the zip file and the second the file to be extracted. The to-be-extracted file's name has to be complete and exactly as it appears in the output of the previous list_remote_zip.py Python script, which it uses to get some information about the file. It then uses wget to download it at the right offset with the right length. It saves this zip "slice" to a temporary file, which is then used to output a gzip-formatted file, which can then be piped to and decompressed with gzip. The "slice" itself is not a valid zip file because it has no central directory at the end. It could be fixed with zip's -FF option but I decided to instead change the headers a little and convert it to a gzip file. Both (PK)zip and gzip use the same deflate compression algorithm and even the CRC-32 checksums seem to be compatible.

Here is an example of how to download a random file from Xonotic's archive available at http://dl.xonotic.org/xonotic-0.8.1.zip, decompress it and save it to a local file:

bash get_file_from_remote_zip.sh http://dl.xonotic.org/xonotic-0.8.1.zip Xonotic/source/darkplaces/mprogdefs.h | gzip -d >mprogdefs.h

Solution 2:

If you are accessing a file server and have winrar (and probably other similar applications) installed, you can open the .zip and drag out the files you want.

If you are talking about a web server, I don't think you can.

Solution 3:

Assuming the server supports resumed downloads it would in theory be possible to write a client that did this--grab a big enough block near the end to get the directory, then use that to figure out what you need to grab to actually get the data--simply start downloading at that position and stop when you have enough data. It's been so long since I was poking around I don't recall if there's a means of finding the start of the directory other than brute force.

I've never heard of such a client and can't imagine why one would be developed--if it's data that reasonably would be downloaded in pieces then why is the webmaster storing it as one big zip file???