Why is gzip slow despite CPU and hard drive performance not being maxed out?

I have some JSON files, 20 GB each, that I want to compress with gzip:

gzip file1.json

This takes up one full CPU core, all fine.

It processes around 25 MB/s (checked in atop), my hard drive can read 125 MB/s and I have 3 free processor cores, so I expect to get a speed-up when compressing multiple files in parallel. So I run in other terminals:

gzip file2.json
gzip file3.json
gzip file4.json

Surprisingly, my throughput does not increase; CPU is around 25% on each core, and my HD still only reads at 25 MB/s.

Why and how to address it?

Solution 1:

I found it out:

The reason is that gzip operates on (in terms of CPU speed vs HD seek speed these days) extremely low buffer sizes.

It reads a few KB from from the input file, compresses it, and flushes it to the output file. Given the fact that this requires a hard drive seek, only a few operations can be done per seconds.

The reason my performance did not scale is because already one gzip was seeking like crazy.

I worked around this by using the unix buffer utility:

buffer -s 100000 -m 10000000 -p 100 < file1.json | gzip > file1.json.gz

By buffering a lot of input before sending it to gzip, the number of small seeks can be dramatically reduced. The options:

-s and -m are to specify the size of the buffer (I believe it is in KB, but not sure)
-p 100 makes sure that the data is only passed to gzip once the buffer is 100% filled

Running four of these in parallel, I could get 4 * 25 MB/s throughput, as expected.

I still wonder why gzip doesn't allow to increase the buffer size - this way, it is pretty useless if run on a spinning disk.

EDIT: I tried out a few more compression programs behaviour:

bzip2 only processes 2 MB/s due to its stronger / more CPU intensive compression
lzop seems to allow larger buffers: 70 MB/s per core, and 2 cores can max out my HD without over-seeking

Solution 2:

After looking at the first five or so lectures in the MIT OpenCourseware for 6.172: "Performance Engineering of Software Systems", I ran the Linux performance analyzer 'perf' on a moderately large test file. The result appears to show pipeline stalls where one instruction has to wait for the result of a preceding one.

       │         while (lookahead != 0) {                                                                
       │             /* Insert the string window[strstart .. strstart+2] in the                          
       │              * dictionary, and set hash_head to the head of the hash chain:                     
       │              */                                                                                 
       │             INSERT_STRING(strstart, hash_head);                                                 
  2.07 │       movzbl 0x8096d82(%edx),%eax                                                               
  3.99 │       mov    %edx,%ebp                                                                          
       │       shl    $0x5,%ecx                                                                          
  0.03 │       and    $0x7fff,%ebp                                                                       
  1.94 │       xor    %ecx,%eax                                                                          
  1.43 │       and    $0x7fff,%eax                                                                       
  2.01 │       mov    %eax,0x805e588                                                                     
  2.40 │       add    $0x8000,%eax                                                                      
  0.88 │       movzwl 0x8062140(%eax,%eax,1),%ecx                                                        
 23.79 │       movzwl %cx,%edi                                                                           
       │             /* Find the longest match, discarding those <= prev_length.

The second last instruction is copying to %ecx and the last one has to wait (stalling the pipeline) until the %cx register has data ready to use. This pipeline stall holds up the containing loop.

This is a result of some really obscure 'old-school' C programming style.

Could I access a blocked website via a virtual machine if the host OS has the web address blocked in its hosts file?

Are different brands of the 'same' graphics card better than other brands of that same graphics card? [closed]

Add 'Open Powershell here as admin' option to folder context menu

In Console.app, how can I reveal to what <private> tags are actually referring?

Which ffmpeg package I should download for macOS?

find command cannot find my files which do exist

Util for trimming output to terminal width?

How to return to last directory with cmd.exe? [duplicate]

Can dual monitors make my PC slower?

Is there a simple way to move/copy a logical volume from one volume group to another? (LVM2)

How to make Windows 7 remember credentials for a Samba network drive?