Why is GNU diff such a memory hog? [closed]

There are enough questions asking how to diff large files, as diff can't handle them.

I'd like to know why GNU diff can't handle them.

I did a tiny experiment. I diffed two identical datasets, something like so

$ time /usr/bin/diff -u <(cat file1) <(cat file2) > /tmp/memoryhog
^C

real    5m6.478s
user    0m0.540s
sys     0m19.184s

This is what top showed at the same time I cancelled the job:

diff memory usage

3  PID  %MEM    VIRT   SWAP    RES   CODE    DATA    SHR nMaj nDRT S  PR  NI  %CPU COMMAND
 19087  30.0   16.0g          9.4g   0.2m   16.0g   2.0m           R  20       0.5 /usr/bin/diff -u /dev/fd/63 /dev/fd/62

And the output was, as expected, empty:

$ stat -c '%s' /tmp/memoryhog 
0

(They weren't actually files but database results, and I forgot to track how much bytes diff actually consumed in that time - guesstimate 30-60GiB per piped file.)

But what's going on there?

diff is allocating tons of memory, when it doesn't even need to track a single byte-change?

I can only assume it's partially from having to keep track of the number of lines, but allocating 16GiB virtual memory seems a bit much for just that task!

What does diff think it needs that much memory for? Or is it just bad memory handling?

I already tried to keep diff as "stateless" or context-less as possible, only using -u, but I couldn't find any options to not track line numbers or otherwise improve things even more.

(The option --speed-large-files is actually a bogus argument, it's not implemented in the code, so please don't suggest that one.)

edit: Correcting the bad result of my own code check, I found this buried in the bug-diffutils ML:

When you pass --speed-large-files, src/diff.c sets speed_large_files to true. Then, src/analyze.c sets ctxt.heuristic to true. Then, gnulib/lib/diffseq.h function diag() applies your heuristic.


Solution 1:

I believe I found the reason for this behavior.
It seems that diff will always read the whole files into memory.
I'm honestly surprised by this. I didn't think this was the case or would be necessary for a line-based tool, but apparently it is.

This info is based on a bug report here: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=21665

Unfortunately I have found that diff reads the entire input files into
memory, leading to "/usr/bin/diff: memory exhausted" messages [...]

with a reply to it which reads:

> Would you be open to patches that enable diffing large files by using
> mmap?

I doubt whether that would help that much, as it still needs to construct 
information about each line, and that information consumes memory too.  Doing 
this in secondary storage would be a bear.  In practice when I've run into this 
problem, I've either gotten a bigger machine or made my input lines shorter. 
Preferably the former.

and finally

As Paul responded [...], using mmap seems
unlikely to help much, but if you write the patch and demonstrate that
it does make a difference, we'll be very interested, and I will
happily reopen the issue.

For now, I'm marking this as notabug and closing it.

With that being the case, it seems GNU diff will stay limited in regards to large file handling, unless either someone finds a way to beat the odds pointed out in the bug report, or implements a differently working diff tool.

If someone comes up with a better or more in-depth answer, perhaps from code review, I'll happily accept it.


P.S. I've only had medium success so far with my own line-by-line reader using Python based on difflib, which was intended to find differences but not create patch-able diff files; it will read several GiB's okay but at some point seems to "desync" and after that point report diffs of lines which are actually identical. And it is slow, of course. If I can build a workable solution at some point, I'll post the source.