How does git detect similar files, for its rename detection?

Wikipedia explains the automatic rename detection:

Briefly, given a file in revision N, a file of the same name in revision N−1 is its default ancestor. However, when there is no like-named file in revision N−1, Git searches for a file that existed only in revision N−1 and is very similar to the new file.

Rename detection apparently boils down to similar file detection. Is that algorithm documented anywhere? It would be nice to know what kinds of transformations are detected automatically.


Solution 1:

Git tracks file contents, not filenames. So renaming a file without changing its content is easy for git to detect. (Git does not track, but performs detection; using git mv or git rm and git add is effectively the same.)

When a file is added to the repository, the filename is in the tree object. The actual file contents are added as a binary large object (blob) in the repository. Git will not add another blob for additional files that contain the same content. In fact, Git cannot as the content is stored in the filesystem with first two characters of the hash being the directory name and the rest being the name of file within it. So to detect renames is a matter of comparing hashes.

To detect small changes to a renamed file, Git uses certain algorithms and a threshold limit to see if this is a rename. For example, have a look at the -M flag for git diff. There are also configuration values such as merge.renameLimit (the number of files to consider when performing rename detection during a merge).

To understand how git treats similar files (i.e., what file transformations are considered as renames), explore the configuration options and flags available, as mentioned above. You need not be considered with the how. To understand how git actually accomplishes these tasks, look at the algorithms for finding differences in text, and read the git source code.

Algorithms are applied only for diff, merge, and log purposes -- they do not affect how git stores them. Any small change in file content means a new object is added for it. There is no delta or diff happening at that level. Of course, later, the objects might be packed where deltas are stored in packfiles, but that is not related to the rename detection.

Solution 2:

Is that algorithm documented anywhere?

It is at least illustrated with Git 2.33 (Q3 2021), where the documentation on "git diff -l<n>"(man) and diff.renameLimit have been updated, and the defaults for these limits have been raised.

See commit 94b82d5, commit 9dd29db, commit 6623a52, commit 05d2c61 (15 Jul 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 268055b, 28 Jul 2021)

rename: bump limit defaults yet again

Signed-off-by: Elijah Newren

These were last bumped in commit 92c57e5 ("bump rename limit defaults (again)", 2011-02-19, Git v1.7.5-rc0 -- merge), and were bumped both because processors had gotten faster, and because people were getting ugly merges that caused problems and reporting it to the mailing list (suggesting that folks were willing to spend more time waiting).

Since that time:

  • Linus has continued recommending kernel folks to set diff.renameLimit=0 (maps to 32767, currently)
  • Folks with repositories with lots of renames were happy to set merge.renameLimit above 32767, once the code supported that, to get correct cherry-picks
  • Processors have gotten faster
  • It has been discovered that the timing methodology used last time probably used too large example files.

The last point is probably worth explaining a bit more:

  • The "average" file size used appears to have been average blob size in the linux kernel history at the time (probably v2.6.25 or something close to it).
  • Since bigger files are modified more frequently, such a computation weights towards larger files.
  • Larger files may be more likely to be modified over time, but are not more likely to be renamed -- the mean and median blob size within a tree are a bit higher than the mean and median of blob sizes in the history leading up to that version for the linux kernel.
  • The mean blob size in v2.6.25 was half the average blob size in history leading to that point
  • The median blob size in v2.6.25 was about 40% of the mean blob size in v2.6.25.
  • Since the mean blob size is more than double the median blob size, any file as big as the mean will not be compared to any files of median size or less (because they'd be more than 50% dissimilar).
  • Since it is the number of files compared that provides the O(n^2) behavior, median-sized files should matter more than mean-sized ones.

The combined effect of the above is that the file size used in past calculations was likely about 5x too large.
Combine that with a CPU performance improvement of ~30%, and we can increase the limits by a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated time limits.

Keeping the same approximate time limit probably makes sense for diff.renameLimit (there is no progress feedback in e.g. git log -p(man)), but the experience above suggests merge.renameLimit could be extended significantly.
In fact, it probably would make sense to have an unlimited default setting for merge.renameLimit, but that would likely need to be coupled with changes to how progress is displayed.
(See https://lore.kernel.org/git/YOx+Ok%[email protected]/ for details in that area.)
For now, let's just bump the approximate time limit from 10s to 1m.

(Note: We do not want to use actual time limits, because getting results that depend on how loaded your system is that day feels bad, and because we don't discover that we won't get all the renames until after we've put in a lot of work rather than just upfront telling the user there are too many files involved.)

Using the original time limit of 2s for diff.renameLimit, and bumping merge.renameLimit from 10s to 60s, I found the following timings using the simple script at the end of this commit message (on an AWS c5.xlarge which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"):

N   Timing
0    1.995s
0   59.973s

So let's round down to nice even numbers and bump the limits from 400->1000, and from 1000->7000.

Here is the measure_rename_perf script (adapted from https://lore.kernel.org/git/[email protected]/ in particular to avoid triggering the linear handling from basename-guided rename detection):

#!/bin/bash

n=$1; shift

rm -rf repo
mkdir repo && cd repo
git init -q -b main

mkdata() {
  mkdir $1
  for i in `seq 1 $2`; do
    (sed "s/^/$i /" <../sample
     echo tag: $1
    ) >$1/$i
  done
}

mkdata initial $n
git add .
git commit -q -m initial

mkdata new $n
git add .
cd new
for i in *; do git mv $i $i.renamed; done
cd ..
git rm -q -rf initial
git commit -q -m new

time git diff-tree -M -l0 --summary HEAD^ HEAD

git config now includes in its man page:

-l. If not set, the default value is currently 1000.

git config now includes in its man page:

currently defaults to 7000.


And, still with Git 2.33 (Q3 2021):

See commit 94b82d5, commit 9dd29db, commit 6623a52, commit 05d2c61 (15 Jul 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 268055b, 28 Jul 2021)

doc: clarify documentation for rename/copy limits

Signed-off-by: Elijah Newren

A few places in the docs implied that rename/copy detection is always quadratic or that all (unpaired) files were involved in the quadratic portion of rename/copy detection.
The following two commits each introduced an exception to this:

  • 9027f53 ("Do linear-time/space rename logic for exact renames", 2007-10-25, Git v1.5.4-rc0 -- merge)
  • bd24aa2 ("diffcore-rename: guide inexact rename detection based on basenames", 2021-02-14, Git v2.31.0-rc1 -- merge)

(As a side note, for copy detection, the basename guided inexact rename detection is turned off and the exact renames will only result in sources (without the dests) being removed from the set of files used in quadratic detection.
So, for copy detection, the documentation was closer to correct.)

Avoid implying that all files involved in rename/copy detection are subject to the full quadratic algorithm.
While at it, also note the default values for all these settings.

git config now includes in its man page:

The number of files to consider in the exhaustive portion of copy/rename detection; equivalent to the 'git diff' option -l.
If not set, the default value is currently 400.
This setting has no effect if rename detection is turned off.

git config now includes in its man page:

The number of files to consider in the exhaustive portion of rename detection during a merge.

If not specified, defaults to the value of diff.renameLimit.
If neither merge.renameLimit nor diff.renameLimit are specified, currently defaults to 1000.
This setting has no effect if rename detection is turned off.

diff-options now includes in its man page:

The -M and -C options involve some preliminary steps that can detect subsets of renames/copies cheaply, followed by an exhaustive fallback portion that compares all remaining unpaired destinations to all relevant sources.
(For renames, only remaining unpaired sources are relevant; for copies, all original sources are relevant.)

For N sources and destinations, this exhaustive check is O(N^2).

This option prevents the exhaustive portion of rename/copy detection from running if the number of source/destination files involved exceeds the specified number.
Defaults to diff.renameLimit.

Solution 3:

There are many algorithms that detect similarities between texts, and version control systems often use these already to store only the difference between two versions. Tools like WinMerge are smart enough to detect differences, even within lines, so I don't see a reason why these algorithms would not be used for this rename detection.

Here is a discussion about algorithms to detect similar texts. Some of these algorithms might be optimized for natural languages, while others may work better for source code, but in essence they are very much alike.