Quantifying the amount of change in a git diff?

I use git for a slightly unusual purpose--it stores my text as I write fiction. (I know, I know...geeky.)

I am trying to keep track of productivity, and want to measure the degree of difference between subsequent commits. The writer's proxy for "work" is "words written", at least during the creation stage. I can't use straight word count as it ignores editing and compression, both vital parts of writing. I think I want to track:

 (words added)+(words removed)

which will double-count (words changed), but I'm okay with that.

It'd be great to type some magic incantation and have git report this distance metric for any two revisions. However, git diffs are patches, which show entire lines even if you've only twiddled one character on the line; I don't want that, especially since my 'lines' are paragraphs. Ideally I'd even be able to specify what I mean by "word" (though \W+ would probably be acceptable).

Is there a flag to git-diff to give diffs on a word-by-word basis? Alternately, is there a solution using standard command-line tools to compute the metric above?


Solution 1:

wdiff does word-by-word comparison. Git can be configured to use an external program to do the diffing. Based on those two facts and this blog post, the following should do roughly what you want.

Create a script to ignore most of the unnecessary arguments that git-diff provides and pass them to wdiff. Save the following as ~/wdiff.py or something similar and make it executable.

#!/usr/bin/python

import sys
import os

os.system('wdiff -s3 "%s" "%s"' % (sys.argv[2], sys.argv[5]))

Tell git to use it.

git config --global diff.external ~/wdiff.py
git diff filename

Solution 2:

Building on James' and cornmacrelf's input, I've added arithmetic expansion, and came up with a few reusable alias commands for counting words added, deleted, and duplicated in a git diff:

alias gitwa='git diff --word-diff=porcelain origin/master | grep -e "^+[^+]" | wc -w | xargs'
alias gitwd='git diff --word-diff=porcelain origin/master | grep -e "^-[^-]" | wc -w | xargs'
alias gitwdd='git diff --word-diff=porcelain origin/master |grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs'

alias gitw='echo $(($(gitwa) - $(gitwd)))'

Output from gitwa and gitwd is trimmed using xargs trick.

Words duplicated added from Miles' answer.

Solution 3:

git diff --word-diff works in the latest stable version of git (at git-scm.com)

There are a few options that let you decide what format you want it in, the default is quite readable but you might want --word-diff=porcelain if you're feeding the output into a script.

Solution 4:

I figured out a way to get concrete numbers by building on top of the other answers here. The result is an approximation, but it should be close enough to serve as a useful indicator of the amount characters that were added or removed. Here's an example with my current branch compared to origin/master:

$ git diff --word-diff=porcelain origin/master | grep -e '^+[^+]' | wc -m
38741
$ git diff --word-diff=porcelain origin/master | grep -e '^-[^-]' | wc -m
46664

The difference between the removed characters (46664) and the added characters (38741) shows that my current branch has removed approximately 7923 characters. Those individual added/removed counts are inflated due to the diff's +/- and indentation characters, however, the difference should cancel out a significant portion of that inflation in most cases.

Solution 5:

Git has had (for a long time) a --color-words option for git diff. This doesn't get you your counting, but it does let you see the diffs.

scompt.com's suggestion of wdiff is also good; it's pretty easy to shove in a different differ (see git-difftool). From there you just have to go from the output wdiff can give to the result you really want.

There's one more exciting thing to share, though, from git's what's cooking:

* tr/word-diff (2010-04-14) 1 commit
  (merged to 'next' on 2010-05-04 at d191b25)
 + diff: add --word-diff option that generalizes --color-words

Here's the commit introducing word-diff. Presumably it will make its way from next into master before long, and then git will be able to do this all internally - either producing its own word diff format or something similar to wdiff. If you're daring, you could build git from next, or just merge that one commit into your local master to build.

Thanks to Jakub's comment: you can further customize word diffs if necessary by providing a word regex (config parameter diff.*.wordRegex), documented in gitattributes.