What's git's heuristic for assigning content modifications to file paths?

Solution 1:

As you noticed, Git performs rename detection using a heuristic, rather than being told that a rename occurred. The git mv command, in fact, simply stages an add on the new file path and a remove of the old file path. Thus, rename detection is performed by comparing the contents of added files to the previously committed contents of deleted files.

First, candidates are collected. Any new files are possible rename targets and any deleted files are possible rename sources. In addition, rewriting changes are broken such that a file that is more than 50% different than its previous revision is both a possible rename source and a possible rename target.

Next, identical renames are detected. If you rename a file without making any changes, then the file will hash identically. These can be detected just performing comparisons of the hash in the index without reading the file contents, so removing these from the candidate list will reduce the number of comparisons you need to perform.

Finally, the similarity comparison is performed. Each line in each candidate file is hashed and collected in a sorted list. Long lines are split at 60 characters. Whitespace only lines may be stripped on the assumption that they don't contribute greatly to the similarity matching. The line hashes from each candidate source are compared to the line hashes from each candidate target. If two lists are 60% similar, they are deemed a rename.

Solution 2:

... short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

Depending on what you mean by "full" I don't think you can find such a thing. (In particular, how are "percentages" calculated? Is it by lines, or characters/bytes, or something else? Does doing a word-oriented diff change things?) But the magic is all inside git diff, where it is computed dynamically every time a diff is to be shown; and the heuristics have several control knobs that give strong clues:

--no-renames

Turn off rename detection, even when the configuration file gives the default to do so.

-B[<n>][/<m>], --break-rewrites[=[<n>][/<m>]]

Break complete rewrite changes into pairs of delete and create. This serves two purposes:

  • It affects the way a change that amounts to a total rewrite of a file not as a series of deletion and insertion mixed together with a very few lines that happen to match textually as the context, but as a single deletion of everything old followed by a single insertion of everything new, and the number m controls this aspect of the -B option (defaults to 60%). -B/70% specifies that less than 30% of the original should remain in the result for Git to consider it a total rewrite (i.e. otherwise the resulting patch will be a series of deletion and insertion mixed together with context lines).

  • When used with -M, a totally-rewritten file is also considered as the source of a rename (usually -M only considers a file that disappeared as the source of a rename), and the number n controls this aspect of the -B option (defaults to 50%). -B20% specifies that a change with addition and deletion compared to 20% or more of the file's size are eligible for being picked up as a possible source of a rename to another file.

and so on; see the documentation for git-diff.