How can I know if `git gc --auto` has done something?
Solution 1:
Update Sept. 2020: you won't have to run only git gc --auto
as part of your automatic saves script.
The old "gc
" can now be superseded by the new git maintenance run --auto
.
And it can display what it is doing.
With Git 2.29 (Q4 2020), A "git gc
"(man)'s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning.
See commit 25914c4, commit 4ddc79b, commit 916d062, commit 65d655b, commit d7514f6, commit 090511b, commit 663b2b1, commit 3103e98, commit a95ce12, commit 3ddaad0, commit 2057d75 (17 Sep 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 48794ac, 25 Sep 2020)
maintenance
: create basic maintenance runnerHelped-by: Jonathan Nieder
Signed-off-by: Derrick Stolee
The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as:
- repacking the repository,
- packing refs, and
- rewriting the commit-graph file.
The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database.
Create a new '
maintenance
' builtin that will become a more general- purpose command.To start, it will only support the '
run
' subcommand, but will later expand to add subcommands for scheduling maintenance in the background.For now, the '
maintenance
' builtin is a thin shim over the 'gc
' builtin.
In fact, the only option is the '--auto
' toggle, which is handed directly to the 'gc
' builtin.
The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin.Use existing
builtin/gc.c
file because we want to share code between the two builtins.
It is possible that we will have 'maintenance
' replace the 'gc
' builtin entirely at some point, leaving 'git gc
(man)' as an alias for some specific arguments to 'git maintenance run
'.Create a new
test_subcommand
helper that allows us to test if a certain subcommand was run. It requires storing theGIT_TRACE2_EVENT
logs in a file.
A negation mode is available that will be used in later tests.
(That last part is one way to ascertain the new git maintainance run --auto
does something)
git maintenance
now includes in its man page:
git-maintenance(1)
NAME
git-maintenance
- Run tasks to optimize Git repository dataSYNOPSIS
[verse] 'git maintenance' run [<options>]
DESCRIPTION
Run tasks to optimize Git repository data, speeding up other Git commands and reducing storage requirements for the repository.
Git commands that add repository data, such as
git add
orgit fetch
, are optimized for a responsive user experience. These commands do not take time to optimize the Git data, since such optimizations scale with the full size of the repository while these user commands each perform a relatively small action.The
git maintenance
command provides flexibility for how to optimize the Git repository.SUBCOMMANDS
run
Run one or more maintenance tasks.
TASKS
gc
Clean up unnecessary files and optimize the local repository. "GC" stands for "garbage collection," but this task performs many smaller tasks. This task can be expensive for large repositories, as it repacks all Git objects into a single pack-file. It can also be disruptive in some situations, as it deletes stale data. See
git gc
for more details on garbage collection in Git.OPTIONS
--auto
When combined with the
run
subcommand, run maintenance tasks only if certain thresholds are met. For example, thegc
task runs when the number of loose objects exceeds the number stored in thegc.auto
config setting, or when the number of pack-files exceeds thegc.autoPackLimit
config setting.
maintenance
: replacerun_auto_gc()
Signed-off-by: Derrick Stolee
The
run_auto_gc()
method is used in several places to trigger a check for repo maintenance after some Git commands, such as 'git commit
'(man) or 'git fetch
'(man).To allow for extra customization of this maintenance activity, replace the '
git gc --auto [--quiet]
(man)' call with one to 'git maintenance run --auto [--quiet]
(man)'.
As we extend the maintenance builtin with other steps, users will be able to select different maintenance activities.Rename
run_auto_gc()
torun_auto_maintenance()
to be clearer what is happening on this call, and to expose all callers in the current diff. Rewrite the method to use a structchild_process
to simplify the calls slightly.Since '
git fetch
'(man) already allows disabling the 'git gc --auto
'(man) subprocess, add an equivalent option with a different name to be more descriptive of the new behavior: '--[no-]maintenance
'.
fetch-options
now includes in its man page:
Run
git maintenance run --auto
at the end to perform automatic repository maintenance if needed. (--[no-]auto-gc
is a synonym.)
This is enabled by default.
git clone
now includes in its man page:
which automatically call
git maintenance run --auto
. (Seegit maintenance
.)
Plus, your save script will be able to make git maintenance
do more than git gc
ever could, thanks to tasks.
maintenance
: add --task optionSigned-off-by: Derrick Stolee
A user may want to only run certain maintenance tasks in a certain order.
Add the
--task=<task>
option, which allows a user to specify an ordered list of tasks to run. These cannot be run multiple times, however.Here is where our array of
maintenance_task
pointers becomes critical. We can sort the array of pointers based on the task order, but we do not want to move the struct data itself in order to preserve the hashmap references. We use the hashmap to match the --task= arguments into the task struct data.Keep in mind that the '
enabled
' member of themaintenance_task
struct is a placeholder for a future 'maintenance.<task>.enabled
' config option. Thus, we use the 'enabled
' member to specify which tasks are run when the user does not specify any--task=<task>
arguments.
The 'enabled
' member should be ignored if--task=<task>
appears.
git maintenance
now includes in its man page:
Run one or more maintenance tasks. If one or more
--task=<task>
options are specified, then those tasks are run in the provided order. Otherwise, only thegc
task is run.
git maintenance
now includes in its man page:
--task=<task>
If this option is specified one or more times, then only run the specified tasks in the specified order. See the 'TASKS' section for the list of accepted
<task>
values.
And:
maintenance
: create maintenance..enabled configSigned-off-by: Derrick Stolee
Currently, a normal run of "
git maintenance run
"(man) will only run the 'gc
' task, as it is the only one enabled.
This is mostly for backwards-compatible reasons since "git maintenance run --auto
"(man) commands replaced previous "git gc --auto
" commands after some Git processes.Users could manually run specific maintenance tasks by calling "
git maintenance run --task=<task>
" directly.Allow users to customize which steps are run automatically using config. The '
maintenance.<task>.enabled
' option then can turn on these other tasks (or turn off the 'gc
' task).
git config
now includes in its man page:
maintenance.<task>.enabled
This boolean config option controls whether the maintenance task with name
<task>
is run when no--task
option is specified togit maintenance run
. These config values are ignored if a--task
option exists.
By default, onlymaintenance.gc.enabled
is true.
git maintenance
now includes in its man page:
Run one or more maintenance tasks. If one or more
--task
options are specified, then those tasks are run in that order. Otherwise, the tasks are determined by whichmaintenance.<task>.enabled
config options are true.
By default, onlymaintenance.gc.enabled
is true.
git maintenance
now also includes in its man page:
If no
--task=<task>
arguments are specified, then only the tasks withmaintenance.<task>.enabled
configured astrue
are considered.
Another way to know if the new git maintenance run
is doing currently anything is to check for a lock (.git/maintenance.lock
file):
maintenance
: take a lock on the objects directorySigned-off-by: Derrick Stolee
Performing maintenance on a Git repository involves writing data to the
.git
directory, which is not safe to do with multiple writers attempting the same operation.
Ensure that only one 'git maintenance
'(man) process is running at a time by holding a file-based lock.Simply the presence of the
.git/maintenance.lock
file will prevent future maintenance. This lock is never committed, since it does not represent meaningful data. Instead, it is only a placeholder.If the lock file already exists, then no maintenance tasks are attempted. This will become very important later when we implement the '
prefetch
' task, as this is our stop-gap from creating a recursive process loop between 'git fetch
'(man) ' and 'git maintenance run --auto
(man).
You can also check if git gc
/git maintenance
will have to do anything.
With Git 2.29 (Q4 2020), A "git gc
"(man) 's big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning.
See commit 25914c4, commit 4ddc79b, commit 916d062, commit 65d655b, commit d7514f6, commit 090511b, commit 663b2b1, commit 3103e98, commit a95ce12, commit 3ddaad0, commit 2057d75 (17 Sep 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 48794ac, 25 Sep 2020)
maintenance
: use pointers to check--auto
Signed-off-by: Derrick Stolee
The '
git maintenance run
(man) ' command has an '--auto' option. This is used by other Git commands such as 'git commit
(man) ' or 'git fetch
(man) ' to check if maintenance should be run after adding data to the repository.Previously, this
--auto
option was only used to add the argument to the 'git gc
'(man) command as part of the 'gc
' task.
We will be expanding the other tasks to perform a check to see if they should do work as part of the--auto
flag, when they are enabled by config.
First, update the 'gc' task to perform the auto check inside the maintenance process.
This prevents running an extra 'git gc --auto
'(man) command when not needed.
It also shows a model for other tasks.Second, use the '
auto_condition
' function pointer as a signal for whether we enable the maintenance task under '--auto
'.
For instance, we do not want to enable the 'fetch' task in '--auto
' mode, so that function pointer will remainNULL
.We continue to pass the '--auto' option to the '
git gc
'(man) command when necessary, because of thegc.autoDetach
config option changes behavior.
Likely, we will want to absorb the daemonizing behavior implied bygc.autoDetach
as amaintenance.autoDetach
config option.
To illustrate what git maintenance
will do that git gc
won't:
maintenance
: add commit-graph taskSigned-off-by: Derrick Stolee
The first new task in the '
git maintenance
(man) ' builtin is the 'commit-graph
' task.
This updates the commit-graph file incrementally with the commandgit commit-graph write --reachable --split
By writing an incremental commit-graph file using the "
--split
" option we minimize the disruption from this operation.The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution.
Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will very likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.)
If a concurrent process reads the old commit-graph-chain file, but our job expires some of the
.graph
files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the--expire-time
argument when writing the commit-graph.
git maintenance
now includes in its man page:
commit-graph
The
commit-graph
job updates thecommit-graph
files incrementally, then verifies that the written data is correct.The incremental write is safe to run alongside concurrent Git processes since it will not expire
.graph
files that were in the previouscommit-graph-chain
file. They will be deleted by a later run based on the expiration delay.
And:
maintenance
: add auto condition forcommit-graph
taskSigned-off-by: Derrick Stolee
Instead of writing a new
commit-graph
in every 'git maintenance run --auto
'(man) process (whenmaintenance.commit-graph.enabled
is configured to betrue
), only write when there are "enough" commits not in acommit-graph
file.This count is controlled by the
maintenance.commit-graph.auto
config option.To compute the count, use a depth-first search starting at each ref, and leaving markers using the
SEEN
flag.
If this count reaches the limit, then terminate early and start the task.
Otherwise, this operation will peel every ref and parse the commit it points to. If these are all in thecommit-graph
, then this is typically a very fast operation.Users with many refs might feel a slow-down, and hence could consider updating their limit to be very small. A negative value will force the step to run every time.
git config
now includes in its man page:
maintenance.commit-graph.auto
This integer config option controls how often the
commit-graph
task should be run as part ofgit maintenance run --auto
.
- If zero, then the
commit-graph
task will not run with the--auto
option.- A negative value will force the task to run every time.
- Otherwise, a positive value implies the command should run when the number of reachable commits that are not in the commit-graph file is at least the value of
maintenance.commit-graph.auto
.The default value is 100.
With Git 2.30 (Q1 2021), the test-coverage enhancement of running commit-graph
task "git maintenance
"(man) as needed led to discovery and fix of a bug.
See commit d334107 (12 Oct 2020), and commit 8f80180 (08 Oct 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 0be2d65, 02 Nov 2020)
maintenance
: test commit-graph auto conditionSigned-off-by: Derrick Stolee
The auto condition for the
commit-graph
maintenance task walks refs looking for commits that are not in thecommit-graph
file.
This was added in 4ddc79b2 ("maintenance
: add auto condition for commit-graph task", 2020-09-17, Git v2.29.0-rc0 -- merge listed in batch #17) but was left untested.The initial goal of this change was to demonstrate the feature works properly by adding tests. However, there was an off-by-one error that caused the basic tests around
maintenance.commit-graph.auto=1
to fail when it should work.The subtlety is that if a ref tip is not in the
commit-graph
, then we were not adding that to the total count. In the test, we see that we have only added one commit since our last commit-graph write, so the auto condition would say there is nothing to do.The fix is simple: add the check for the
commit-graph
position to see that the tip is not in thecommit-graph
file before starting our walk. Since this happens before adding to the DFS stack, we do not need to clear our (currently empty) commit list.This does add some extra complexity for the test, because we also want to verify that the walk along the parents actually does some work. This means we need to add at least two commits in a row without writing the
commit-graph
. However, we also need to make sure no additional refs are pointing to the middle of this list or else thefor_each_ref()
inshould_write_commit_graph()
might visit these commits as tips instead of doing a DFS walk. Hence, the last two commits are added with "git commit
"(man) instead of"test_commit"
.
With Git 2.30 (Q1 2021), "git maintenance
(man) run/start/stop" needed to be run in a repository to hold the lockfile they use, but didn't make sure they are actually in a repository, which has been corrected.
See commit 0a1f2d0 (08 Dec 2020) by Josh Steadmon (steadmon
).
See commit e72f7de (26 Nov 2020) by Rafael Silva (raffs
).
(Merged by Junio C Hamano -- gitster
-- in commit f2a75cb, 08 Dec 2020)
maintenance
: fix SEGFAULT when no repositorySigned-off-by: Rafael Silva
Reviewed-by: Derrick Stolee
The "
git maintenance run git
"(man) and "git maintenance start/stop
" commands holds a file-based lock at the.git/maintenance.lock
and.git/schedule.lock
respectively. These locks are used to ensure only one maintenance process is executed at the time as both operations involves writing data into the repository.The path to the lock file is built using
"
the_repository->objects->odb->path"that results in SEGFAULT when we have no repository available as `"`the_repository->objects->odb"
is set toNULL
.Let's teach maintenance command to use
RUN_SETUP
option that will provide the validation and fail when running outside of a repository. Hence fixing the SEGFAULT for all three operations and making the behaviour consistent across all subcommands.Setting the
RUN_SETUP
also provides the same protection for all subcommands given that the "register" and "unregister" also requires to be executed inside a repository.Furthermore let's remove the local validation implemented by the "register" and "unregister" as this will not be required anymore with the new option.
Solution 2:
With Git 2.30 (Q1 2021), "git maintenance
"(man) , the extended big brother of "git gc
"(man) presented in the previous answer, continues to evolve.
It is more precise than git gc
and the options introduced in 2.30 allow to know when it has done something, as asked in the OP.
See commit e841a79, commit a13e3d0, commit 52fe41f, commit efdd2f0, commit 18e449f, commit 3e220e6, commit 252cfb7, commit 28cb5e6 (25 Sep 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 52b8c8c, 27 Oct 2020)
maintenance
: add incremental-repack taskSigned-off-by: Derrick Stolee
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files.
One issue with running '
git repack
(man) ' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk.Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since '
git multi-pack-index expire
(man) ' was added in 19575c7 ("multi-pack-index
: implement 'expire' subcommand", 2019-06-10, Git v2.23.0-rc0 -- merge listed in batch #6) and 'git multi-pack-index repack
(man) ' was added in ce1e4a1 ("midx
: implementmidx_repack()
", 2019-06-10, Git v2.23.0-rc0 -- merge listed in batch #6).The 'incremental-repack' task runs the following steps:
'
git multi-pack-index write
(man)' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job.When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file."
'
git multi-pack-index expire
(man)' deletes any unreferenced pack-files and updates the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.)'
git multi-pack-index repack --bacth-size=<size>
(man)' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size.The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step.
In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change.
These steps are based on a similar background maintenance step in Scalar (and VFS for Git). This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had thousands of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 ("
midx
: add packs topacked_git
linked list", 2019-04-29, Git v2.22.0-rc1 -- merge).These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack-files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression.
By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state.
git maintenance
now includes in its man page:
incremental-repack
The
incremental-repack
job repacks the object directory using themulti-pack-index
feature. In order to prevent race conditions with concurrent Git commands, it follows a two-step process. First, it callsgit multi-pack-index expire
to delete pack-files unreferenced by themulti-pack-index
file. Second, it callsgit multi-pack-index repack
to select several small pack-files and repack them into a bigger one, and then update themulti-pack-index
entries that refer to the small pack-files to refer to the new pack-file. This prepares those small pack-files for deletion upon the next run ofgit multi-pack-index expire
. The selection of the small pack-files is such that the expected size of the big pack-file is at least the batch size; see the--batch-size
option for therepack
subcommand ingit multi-pack-index
. The default batch-size is zero, which is a special case that attempts to repack all pack-files into a single pack-file.
And:
maintenance
: add incremental-repack auto conditionSigned-off-by: Derrick Stolee
The incremental-repack task updates the multi-pack-index by deleting pack-files that have been replaced with new packs, then repacking a batch of small pack-files into a larger pack-file. This incremental repack is faster than rewriting all object data, but is slower than some other maintenance activities.
The '
maintenance.incremental-repack.auto
' config option specifies how many pack-files should exist outside of the multi-pack-index before running the step.
These pack-files could be created by 'git fetch
(man)' commands or by the loose-objects task.
The default value is 10.Setting the option to zero disables the task with the '
--auto
' option, and a negative value makes the task run every time.
git config
now includes in its man page:
maintenance.incremental-repack.auto
This integer config option controls how often the
incremental-repack
task should be run as part ofgit maintenance run --auto
. If zero, then theincremental-repack
task will not run with the--auto
option. A negative value will force the task to run every time. Otherwise, a positive value implies the command should run when the number of pack-files not in the multi-pack-index is at least the value ofmaintenance.incremental-repack.auto
. The default value is 10.
With Git 2.30 (Q1 2021), adds parts of "git maintenance
"(man) to ease writing crontab entries (and other scheduling system configuration) for it.
See commit 0016b61, commit 61f7a38, commit a4cb1a2 (15 Oct 2020), commit 2fec604, commit 0c18b70, commit 4950b2a, commit b08ff1f (11 Sep 2020), and commit 1942d48 (28 Aug 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 7660da1, 18 Nov 2020)
maintenance
: add troubleshooting guide to docsHelped-by: Junio C Hamano
Signed-off-by: Derrick Stolee
The '
git maintenance run
(man) ' subcommand takes a lock on the object database to prevent concurrent processes from competing for resources. This is an important safety measure to prevent possible repository corruption and data loss.This feature can lead to confusing behavior if a user is not aware of it. Add a TROUBLESHOOTING section to the '
git maintenance
(man) ' builtin documentation that discusses these tradeoffs.The short version of this section is that Git will not corrupt your repository, but if the list of scheduled tasks takes longer than an hour then some scheduled tasks may be dropped due to this object database collision.
For example, a long-running "daily" task at midnight might prevent an "hourly" task from running at 1AM.The opposite is also possible, but less likely as long as the "hourly" tasks are much faster than the "daily" and "weekly" tasks.
git maintenance
now includes in its man page:
TROUBLESHOOTING
The
git maintenance
command is designed to simplify the repository maintenance patterns while minimizing user wait time during Git commands. A variety of configuration options are available to allow customizing this process. The default maintenance options focus on operations that complete quickly, even on large repositories.Users may find some cases where scheduled maintenance tasks do not run as frequently as intended. Each
git maintenance run
command takes a lock on the repository's object database, and this prevents other concurrentgit maintenance run
commands from running on the same repository. Without this safeguard, competing processes could leave the repository in an unpredictable state.The background maintenance schedule runs
git maintenance run
processes on an hourly basis. Each run executes the "hourly" tasks. At midnight, that process also executes the "daily" tasks. At midnight on the first day of the week, that process also executes the "weekly" tasks. A single process iterates over each registered repository, performing the scheduled tasks for that frequency. Depending on the number of registered repositories and their sizes, this process may take longer than an hour. In this case, multiplegit maintenance run
commands may run on the same repository at the same time, colliding on the object database lock. This results in one of the two tasks not running.If you find that some maintenance windows are taking longer than one hour to complete, then consider reducing the complexity of your maintenance tasks. For example, the
gc
task is much slower than theincremental-repack
task. However, this comes at a cost of a slightly larger object database. Consider moving more expensive tasks to be run less frequently.Expert users may consider scheduling their own maintenance tasks using a different schedule than is available through
git maintenance start
and Git configuration options. These users should be aware of the object database lock and how concurrentgit maintenance run
commands behave. Further, thegit gc
command should not be combined withgit maintenance run
commands.git gc
modifies the object database but does not take the lock in the same way asgit maintenance run
. If possible, usegit maintenance run --task=gc
instead ofgit gc
.