Is there a diff-like algorithm that handles moving block of lines?
The diff
program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by patch
and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here.
But it falls down when blocks of text are moved within the file.
Suppose you have the following two files, a.txt
and b.txt
(imagine that they're both hundreds of lines long rather than just 6):
a.txt b.txt
----- -----
1 4
2 5
3 6
4 1
5 2
6 3
diff a.txt b.txt
shows this:
$ diff a.txt b.txt
1,3d0
< 1
< 2
< 3
6a4,6
> 1
> 2
> 3
The change from a.txt
to b.txt
can be expressed as "Take the first three lines and move them to the end", but diff
shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly.
Note that diff -e
shows the block of text only once, but that's because it doesn't show the contents of deleted lines.
Is there a variant of the diff
algorithm that (a) retains diff
's ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?
Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.
The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring.
The following method is able to detect block moves:
Paul Heckel: A technique for isolating differences between files
Communications of the ACM 21(4):264 (1978)
http://doi.acm.org/10.1145/359460.359467 (access restricted)
Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)
wikEd diff is a free JavaScript diff library that implements this algorithm and improves on it. It also includes the code to compile a text output with insertions, deletions, moved blocks, and original block positions inserted into the new text version. Please see the project page or the extensively commented code for details. For testing, you can also use the online demo.
Git 2.16 (Q1 2018) will introduce another possibility, by ignoring some specified moved lines.
"git diff
" learned a variant of the "--patience
" algorithm, to which the user can specify which 'unique' line to be used as anchoring points.
See commit 2477ab2 (27 Nov 2017) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit d7c6c23, 19 Dec 2017)
diff
: support anchoring line(s)
Teach
diff
a new algorithm, one that attempts to prevent user-specified lines from appearing as a deletion or addition in the end result.
The end user can use this by specifying "--anchored=<text>
" one or more times when using Git commands like "diff
" and "show
".
The documentation for git diff
now reads:
--anchored=<text>:
Generate a diff using the "anchored diff" algorithm.
This option may be specified more than once.
If a line exists in both the source and destination, exists only once, and starts with this text, this algorithm attempts to prevent it from appearing as a deletion or addition in the output.
It uses the "patience diff" algorithm internally.See the tests for some examples:
pre post a c b a c b
normally,
c
is moved to produce the smallest diff.
But:
git diff --no-index --anchored=c pre post
Diff would be a
.
With Git 2.33 (Q3 2021), the command line completion (in contrib/
) learned that "git diff
"(man) takes the --anchored
option.
See commit d1e7c2c (30 May 2021) by Thomas Braun (t-b
).
(Merged by Junio C Hamano -- gitster
-- in commit 3a7d26b, 08 Jul 2021)
completion
: add --anchored to diff's optionsSigned-off-by: Thomas Braun
This flag was introduced in 2477ab2 ("
diff
: support anchoring line(s)", 2017-11-27, Git v2.16.0-rc0 -- merge listed in batch #10) but back then, the bash completion script did not learn about the new flag.
Add it.