python: read lines from compressed text files

Solution 1:

Using gzip.GzipFile:

import gzip

with gzip.open('input.gz','rt') as f:
    for line in f:
        print('got line', line)

Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode). I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

Solution 2:

You could use the standard gzip module in python. Just use:

gzip.open('myfile.gz')

to open the file as any other file and read its lines.

More information here: Python gzip module

Solution 3:

Have you tried using gzip.GzipFile? Arguments are similar to open.

Solution 4:

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.

The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:

#!/usr/bin/python
import os
import sys
import time
import gzip

def local_unzip(obj):
    t0 = time.time()
    count = 0
    with obj as f:
        for line in f:
            count += 1
    print(time.time() - t0, count)

r = sys.argv[1]
if sys.argv[2] == "1":
    local_unzip(gzip.open(r,'rt'))
else:
    local_unzip(os.popen('pigz -dc ' + r))

Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:

$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116

$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996

Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.