Anyone have a diff algorithm for rendered HTML? [closed]

Solution 1:

There's another nice trick you can use to significantly improve the look of a rendered HTML diff. Although this doesn't fully solve the initial problem, it will make a significant difference in the appearance of your rendered HTML diffs.

Side-by-side rendered HTML will make it very difficult for your diff to line up vertically. Vertical alignment is crucial for comparing side-by-side diffs. In order to improve the vertical alignment of a side-by-side diff, you can insert invisible HTML elements in each version of the diff at "checkpoints" where the diff should be vertically aligned. Then you can use a bit of client-side JavaScript to add vertical spacing around checkpoint until the sides line up vertically.

Explained in a little more detail:

If you want to use this technique, run your diff algorithm and insert a bunch of visibility:hidden <span>s or tiny <div>s wherever your side-by-side versions should match up, according to the diff. Then run JavaScript that finds each checkpoint (and its side-by-side neighbor) and adds vertical spacing to the checkpoint that is higher-up (shallower) on the page. Now your rendered HTML diff will be vertically aligned up to that checkpoint, and you can continue repairing vertical alignment down the rest of your side-by-side page.

Solution 2:

Over the weekend I posted a new project on codeplex that implements an HTML diff algorithm in C#. The original algorithm was written in Ruby. I understand you were looking for a JavaScript implementation, perhaps having one available in C# with source code could assist you to port the algorithm. Here is the link if you are interested: htmldiff.codeplex.com. You can read more about it here.

UPDATE: This library has been moved to GitHub.

Solution 3:

I ended up needing something similar awhile back. To get the HTML to line up side to side, you could use two iFrames, but you'd then have to tie their scrolling together via javascript as you scroll (if you allow scrolling).

To see the diff, however, you will more than likely want to use someone else's library. I used DaisyDiff, a Java library, for a similar project where my client was happy with seeing a single HTML rendering of the content with MS Word "track changes"-like markup.

HTH

Solution 4:

Consider using the output of links or lynx to render a text-only version of the html, and then diff that.

Solution 5:

What about DaisyDiff (Java and PHP vesions available).

Following features are really nice:

  • Works with badly formed HTML that can be found "in the wild".
  • The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
  • In addition to the default visual diff, HTML source can be diffed coherently.
  • Provides easy to understand descriptions of the changes.
  • The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.