Calculate minimal operations to make two tree structures identical

This is more of a CS question, but an interesting one :

Let's say we have 2 tree structures with more or less the same nodes reorganized. How would you find

  1. any
  2. in some sense minimal

sequence of operations

  • MOVE(A, B) - moves node A under node B (with the whole subtree)
  • INSERT(N, B) - inserts a new node N under node B
  • DELETE (A) - deletes the node A (with the whole subtree)

that transforms one tree to the other.

There might obviously be cases where such transformation is not possible, trivial being root A with child B to root B with child A etc.). In such cases, the algorithm would simply deliver an result "not possible".

Even more spectacular version is a generalization for networks, i.e. when we assume that a node can occur multiple times in the tree (effectively having multiple "parents"), while cycles are forbidden.

Disclaimer : This is not a homework, actually it comes from a real business problem and I found it quite interesting wondering if somebody might know a solution.


There is not only a Wikipedia article on graph isomorphism (as Space_C0wb0y points out) but also a dedicated article on the graph isomorphism problem. It has a section Solved special cases for which polynomial-time solutions are known. Trees is one of them and it cites the following two references:

  • P.J. Kelly, "A congruence theorem for trees" Pacific J. Math., 7 (1957) pp. 961–968
  • Aho, Alfred V.; Hopcroft, John; Ullman, Jeffrey D. (1974), The Design and Analysis of Computer Algorithms, Reading, MA: Addison–Wesley .

You weren't clear if you were comparing abstract syntax trees for source code, XML documents interpreted as trees, or some other type of tree.

There's a number of papers that discuss comparing syntax trees and computing minimum distances by various means. The ideas should be relevant.

A good paper is Change Distilling, which tries to compare the source code for two abstract syntax trees and report a minimal difference. The paper talks about a specific method, and also briefly mentions (and provides references) to a variety of similar techniques.

Few of these algorithms are actually realized in available tools for comparing computer program source text. Our Smart Differencer is one of them; it is driven by an explicit language grammar for many languages.


Although this question is old, i'll add a couple more references and algorithms below:

  1. X-Diff: An Effective Change Detection Algorithm for XML Documents , Yuan Wang, David J. DeWitt, Jin-Yi Cai
  2. KF-Diff+: Highly Efficient Change Detection Algorithm for XML Documents
  3. diffX: An Algorithm to Detect Changes in Multi-Version XML Documents
  4. Change Detection in XML Trees: a Survey, Luuk Peters
  5. Similarity in Tree Data Structures

Furthermore, there are libraries and frameworks on GitHub (in javascript) which implement diffing of Tree-like structures for example applications dealing with JSON data or XML Trees (e.g for client-side MVC/MVVM):

  1. React.js
  2. JSON-Patch
  3. jsondiffpatch
  4. objectDiff

In case folks find this question and need something implemented for Node.js or the browser, I'm providing a link and code example for an implementation I've written that you can find on github here: (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram).

This is a tree edit distance approximation algorithm, but it is much, much faster than trying to find the true edit distance. The approximation performs in O(n log n) time and O(n) space whereas true edit distance is often O(n^3) or O(n^2) using known algorithms for true edit distance. See the academic paper from which the PQ-Gram algorithm comes: (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf)

So using jqgram:

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3 },
function(result) {
    console.log(result.distance);
});

And that gives you a number between 0 and 1. The closer to zero, the more closely related the two trees look to jqgram. One approach could be to use jqgram to narrow in on several closely related trees from among many trees given its speed, then utilize true edit distance on the few trees remaining that you need to take a closer inspection of, and for that you can find python implementations for reference or port of the Zhang & Shasha algorithm for example.

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, its Javascript, so use it client or server-side!

ALSO, to answer with respect to cycle detection, check out the clone method inside of jqgram, there is cycle detection there, but credit for that goes to the author of node-clone from which that piece was slightly modified and included.