Myers diff algorithm vs Hunt–McIlroy algorithm
The longest common subsequence problem is a classic computer science problem, algorithms to solve it are the root of version control systems and wiki engines. Two basic algorithms are the Hunt–McIlroy algorithm which was used to create the original version of diff
, and the Myers diff algorithm which is used by the GNU diff utility currently. Both seem to work more or less by finding the shortest path through a graph that represents the edit space between the two strings or text files. The edit space is the number of inserts or deletes necessary to transform one sequence into the other. So what exactly is the difference between Myer's diff algorithm and the Hunt–McIlroy algorithm?
Solution 1:
-
The Myers algorithm is a "divide and conquer algorithm": it works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memorized, and the two subsequences preceding and following it are compared again recursively until there is nothing more to compare. Finding the central match is done by matching the ends of sub-sequences as far as possible, and any time it is not possible, augment the edit script by 1, scanning each furthest position attained up to there for each diagonal and see how far again matches can go, if a match encounter a match of the other end, the algorithm just found the central match. This approach has the advantage to occupy O(m+n) memory, and executes in O(nd), d being the edit script complexity.
-
The Hunt and McIlroy approach compute matches in the whole file, and indexes them into so called k-candidates. k being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this each path is memorized. The problem with the approach is that it does more job than necessary: it memorizes all the paths, O(mn) memory in the worst case, and o(mn log m) for the time complexity.
The Myers algorithm mostly wins because it does not memorize the paths while working, and does not need to "foresee" where to go, doing so it can concentrate at any time only on the furthest positions it could reach with an edit script of smallest complexity. This "smallest complexity" ensures that what is found is the LCS, and not memorizing through which path it went until it found a match uses less memory. Not trying to compute all matches ahead of time avoids their storage, and make them a benefit at match time rather than a memory hog.