What algorithm does git use to detect changes on your working tree?
Git’s index maintains timestamps of when git last wrote each file into the working tree (and updates these whenever files are cached from the working tree or from a commit). You can see the metadata with git ls-files --debug
. In addition to the timestamp, it records the size, inode, and other information from lstat to reduce the chance of a false positive.
When you perform git-status, it simply calls lstat on every file in the working tree and compares the metadata in order to quickly determine which files are unchanged. This is described in the documentation under racy-git and update-index.
On a unix file-system, the file-info is tracked and can be accesed using lstat method. The stat structure contains multiple time-stamps, size information, and more:
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* inode number */
mode_t st_mode; /* protection */
nlink_t st_nlink; /* number of hard links */
uid_t st_uid; /* user ID of owner */
gid_t st_gid; /* group ID of owner */
dev_t st_rdev; /* device ID (if special file) */
off_t st_size; /* total size, in bytes */
blksize_t st_blksize; /* blocksize for file system I/O */
blkcnt_t st_blocks; /* number of 512B blocks allocated */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};
It seems that initially Git simply relied on this stat structure to decide if a file had been changed (see reference):
When checking if they differ, Git first runs
lstat(2)
on the files and compares the result with this information
However, a race condition was reported (racy-git) that found if a file was modified in the following manner:
: modify 'foo'
$ git update-index 'foo'
: modify 'foo' again, in-place, without changing its size
(And quickly enough to not change it's timestamps)
This left the file in a state that was modified but not detectable by lstat.
To fix this issue, now in such situations where lstat state is ambiguous, Git compares the contents of the files to determine if it has been changed.
NOTE:
If anyone is confused, like I was, about st_mtime description, which states that it is updated by writes "of more than zero bytes," this means absolute change.
For example, in the case of a text file file with a single character A
: if A
is changed to B
there is 0 net change in total byte size, but the st_mtime will still be updated (had to try it myself to verify, use ls -l
to see timestamp).