How to update a git shallow clone?

Background

(for tl;dr, see #questions below)

I have multiple git repository shallow clones. I'm using shallow clones because it's a lot smaller compared to a deep clone. Each is cloned doing about git clone --single-branch --depth 1 <git-repo-url> <dir-name>.

This works fine, except I don't see how to update it.

When I'm cloning by a tag, update is not meaningful, as a tag is frozen point in time (as I understand it). In this case, if I want to update, this means I want to clone by another tag, so I just rm -rf <dir-name> and clone again.

Things get more complicated when I’ve cloned the HEAD of a master branch then later want to update it.

I tried git pull --depth 1 but although I'm not to push anything to the remote repository, it complains it don’t know who I am.

I tried git fetch --depth 1, but although it seems to update something, I checked it is not up to date (some files on the remote repository have a different content than the ones on my clone).

After https://stackoverflow.com/a/20508591/279335 , I tried git fetch --depth 1; git reset --hard origin/master, but two things: first I don't understand why git reset is needed, second, although the files seems to be up to date, some old files remains, and git clean -df does not delete these files.

Questions

Let a clone created with git clone --single-branch --depth 1 <git-repo-url> <dir-name>. How to update it to achieve the same result as rm -rf <dir-name>; git clone --single-branch --depth 1 <git-repo-url> <dir-name>? Or is rm -rf <dir-name> and clone again the only way?

Note

This is not a duplicate of How to update a shallow cloned submodule without increasing main repo size , as the answer does not fulfil my expectations and I'm using simple repositories, not sub‑modules (which I don't know about).


TL;DR

Given that you have an existing --depth 1 repository cloned from branch B and you'd like Git to act as if you removed and re-cloned, you can use this sequence of commands:

git fetch --depth 1
git reset --hard origin/B
git clean -dfx

(e.g., git reset --hard origin/master—I cannot put italics in the code-literal section above). You should be able to do the git clean step at any point before or after the other two commands, but the git reset must come after the git fetch.

Long

[slightly reworded and formatted] Given a clone created with git clone --single-branch --depth 1 url directory, how can I update it to achieve the same result as rm -rf directory; git clone --single-branch --depth 1 url directory?

Note that --single-branch is the default when using --depth 1. The (single) branch is the one you give with -b. There's a long aside that goes here about using -b with tags but I will leave that for later. If you don't use -b, your Git asks the "upstream" Git—the Git at url—which branch it has checked-out, and pretends you used -b thatbranch. This means that it is important to be careful when using --single-branch without -b to make sure that this upstream repository's current branch is sensible, and of course, when you do use -b, to make sure that the branch argument you give really does name a branch, not a tag.

The simple answer is basically this one, with two slight changes:

After https://stackoverflow.com/a/20508591/279335, I tried git fetch --depth 1; git reset --hard origin/master, but two things: first I don't understand why git reset is needed, second, although the files seems to be up to date, some old files remains, and git clean -df does not delete these files.

The two slight changes are: make sure you use origin/branchname instead, and add -x (git clean -d -f -x or git clean -dfx) to the git clean step. As for why, that gets a bit more complicated.

What's going on

Without --depth 1, the git fetch step calls up the other Git and gets from it a list of branch names and corresponding commit hash IDs. That is, it finds a list of all the upstream's branches and their current commits. Then, because you have a --single-branch repository, your Git throws out all but the single branch, and brings over everything Git needs to connect that current commit back to the commit(s) you already have in your repository.

With --depth 1, your Git doesn't bother connecting the new commit to older historical commits at all. Instead, it obtains just the one commit and the other Git objects needed to complete that one commit. It then writes an additional "shallow graft" entry to mark that one commit as a new pseudo-root commit.

Regular (non-shallow) clone and fetch

These are all related to how Git behaves when you're using a normal (non-shallow, non-single-branch) clone: git fetch calls up the upstream Git, gets a list of everything, and then brings over whatever you don't already have. This is why an initial clone is so slow, and a fetch-to-update is usually so fast: once you get a full clone, the updates rarely have very much to bring over: maybe a few commits, maybe a few hundred, and most of those commits don't need much else either.

The history of a repository is formed from the commits. Each commit names its parent commit (or for merges, parent commits, plural), in a chain that goes backwards from "the latest commit", to the previous commit, to some more-ancestral commit, and so on. The chain eventually stops when it reaches a commit that has no parent, such as the first commit ever made in the repository. This kind of commit is a root commit.

That is, we can draw a graph of commits. In a really simple repository the graph is just a straight line, with all the arrows pointing backwards:

o <- o <- o <- o   <-- master

The name master points to the fourth and latest commit, which points back to the third, which points back to the second, which points back to the first.

Each commit carries with it a complete snapshot of all the files that go in that commit. Files that are not at all changed are shared across these commits: the fourth commit just "borrows" the unchanged version from the third commit, which "borrows" it from the second, and so on. Hence, each commit names all the "Git objects" that it needs, and Git either finds those objects locally—because it already has them—or uses the fetch protocol to bring them over from the other, upstream Git. There's a compression format called "packing", and a special variant for network transfer called "thin packs", that allows Git to do this even better / fancier, but the principle is simple: Git needs all, and only, those objects that go with the new commits it's picking up. Your Git decides whether it has those objects, and if not, obtains them from their Git.

A more-complicated, more-complete graph generally has several points where it branches, some where it merges, and multiple branch names pointing to different branch tips:

        o--o   <-- feature/tall
       /
o--o--o---o    <-- master
    \    /
     o--o      <-- bug/short

Here branch bug/short is merged back into master, while branch feature/tall is still undergoing development. The name bug/short can (probably) now be deleted entirely: we don't need it anymore if we are done making commits on it. The commit at the tip of master names two previous commits, including the commit at the tip of bug/short, so by fetching master we will fetch the bug/short commits.

Note that both the simple and slightly-more-complicated graph each have just one root commit. That's pretty typical: all repositories that have commits have at least one root commit, since the very first commit is always a root commit; but most repositories have only one root commit as well. You can, however, have different root commits, as with this graph:

 o--o
     \
o--o--o   <-- master

or this one:

 o--o     <-- orphan

o--o      <-- master

In fact, the one with just the one master was probably made by merging orphan into master, then deleting the name orphan.

Grafts and replacements

Git has for a long time had (possibly shaky) support for grafts, which was replaced with (much better, actually-solid) support for generic replacements. To grasp them concretely we need to add, to the above, the notion that each commit has its own unique ID. These IDs are the big ugly 40-character SHA-1 hashes, face0ff... and so on. In fact, every Git object has a unique ID, though for graph purposes, all we care about are the commits.

For drawing graphs, those big hash IDs are too painful to use, so we can use one-letter names A through Z instead. Let's use this graph again but put in one-letter names:

        E--H   <-- feature/tall
       /
A--B--D---G    <-- master
    \    /
     C--F      <-- bug/short

Commit H refers back to commit E (E is H's parent). Commit G, which is a merge commit—meaning it has at least two parents—refers back to both D and F, and so on.

Note that the branch names, feature/tall, master, and bug/short, each point to one single commit. The name bug/short points to commit F. This is why commit F is on branch bug/short ... but so is commit C. Commit C is on bug/short because it is reachable from the name. The name gets us to F, and F gets us to C, so C is on branch bug/short.

Note, however, that commit G, the tip of master, gets us to commit F. This means that commit F is also on branch master. This is a key concept in Git: commits may be on one, many, or even no branches. A branch name is merely a way to get started within a commit graph. There are other ways, such as tag names, refs/stash (which gets you to the current stash: each stash is actually a couple of commits), and the reflogs (which are normally hidden from view as they are normally just clutter).

This also, however, gets us to grafts and replacements. A graft is just a limited kind of replacement, and shallow repositories use a limited form of graft.1 I won't describe replacements fully here as they are a bit more complicated, but in general, what Git does for all of these is to use the graft or replacement as an "instead-of". For the specific case of commits, what we want here is to be able to change—or at least, pretend to change—the parent ID or IDs of any commit ... and for shallow repositories, we want to be able to pretend that the commit in question has no parents.


1The way shallow repositories use the graft code is not shaky. For the more general case, I recommended using git replace instead, as that also was and is not shaky. The only recommended use for grafts is—or at least was, years ago—to put them in place just long enough to run git filter-branch to copy an altered—grafted—history, after which you should just discard the grafted history entirely. You can use git replace for this purpose as well, but unlike grafts, you can use git replace permanently or semi-permanently, without needing git filter-branch.


Making a shallow clone

To make a depth-1 shallow clone of the current state of the upstream repository, we will pick one of the three branch names—feature/tall, master, or bug/short—and translate it to a commit ID. Then we will write a special graft entry that says: "When you see that commit, pretend that it has no parent commits, i.e., is a root commit."

Let's say we pick master. The name master points to commit G, so to make a shallow clone of commit G, we obtain commit G from the upstream Git as usual, but then write a special graft entry that claims commit G has no parents. We put that into our repository, and now our graph looks like this:

G   <-- master, origin/master

Those parent IDs are still actually inside G; it's just that every time we have Git use or show us the history, it immediately "grafts" nothing-at-all on, so that G seems to be a root commit, for history tracking purposes.

Updating a shallow clone we made earlier

But what if we already have a (depth-1 shallow) clone, and we want to update it? Well, that's not really a problem. Let's say we made a shallow clone of the upstream back when master pointed to commit B, before the new branches and the bug fix. That means we currently have this:

B   <-- master, origin/master

While B's real parent is A, we have a shallow-clone graft entry saying "pretend B is a root commit". Now we git fetch --depth 1, which looks up the upstream's master—the thing we call origin/master—and sees commit G. We grab commit G from the upstream, along with its objects, but deliberately don't grab commits D and F. We then update our shallow-clone graft entries to say "pretend G is a root commit too":

B   <-- master

G   <-- origin/master

Our repository now has two root commits: The name master (still) points to commit B, whose parents we (still) pretend are non-existent, and the name origin/master points to G, whose parents we pretend are non-existent.

This is why you need git reset

In a normal repository, you might use git pull, which really is git fetch followed by git merge. But git merge requires history, and we have none: we have faked Git out with pretend root commits, and they have no history behind them. So we must use git reset instead.

What git reset does is a bit complicated, because it can affect up to three different things: a branch name, the index, and the work-tree. We have already seen what the branch names are: they simply point to a (one, specific) commit, which we call the tip of the branch. That leaves the index and work-tree.

The work-tree is easy to explain: it's where all your files are. That's it: no more and no less. It's there so that you can actually use Git: Git is all about storing every commit ever made, forever, so that they can all be retrieved. But they're in a format useless to mere mortals. To be used, a file—or more typically, a whole commit's worth of files—has to be extracted into its normal format. The work-tree is where that happens, and then you can work on it and make new commits using it too.

The index is a bit harder to explain. It's something peculiar to Git: other version control systems don't have one, or if they have something like it, they don't expose it. Git does. Git's index is essentially where you keep the next commit to make, but that means that it starts out holding the current commit that you have extracted into the work-tree, and Git uses that to make Git fast. We'll say more about this in a bit.

What git reset --hard does is to affect all three: branch name, index, and work-tree. It moves the branch name so that it points to a (probably different) commit. Then it updates the index to match that commit, and updates the work-tree to match the new index.

Hence:

git reset --hard origin/master

tells Git to look up origin/master. Since we ran our git fetch, that now points to commit G. Git then makes our master—our current (and only) branch—also point to commit G, and then updates our index and work-tree. Our graph now looks like this:

B   [abandoned - but see below]

G   <-- master, origin/master

Now master and origin/master both name commit G, and commit G is the one checked-out into the work-tree.

Why you need git clean -dfx

The answer here is a bit complicated, but usually it's "you don't" (need to git clean).

When you do need git clean, it is because you—or something you ran—added files to your work-tree that you have not told Git about. These are untracked and/or ignored files. Using git clean -df will remove untracked files (and empty directories); adding -x will also remove the ignored files.

For more about the difference between "untracked" and "ignored", see this answer.

Why you don't need git clean: the index

I mentioned above that you usually don't need to run git clean. This is because of the index. As I said earlier, Git's index is mainly "the next commit to make". If you never add your own files—if you are just using git checkout to check out various existing commits that you have had all along, or that you have added with git fetch; or if you are using git reset --hard to move a branch name and also switch the index and work-tree to another commit—then whatever is in the index right now is there because an earlier git checkout (or git reset) put it in the index, and also into the work-tree.

In other words, the index has a short—and fast for Git to access—summary or manifest describing the current work-tree. Git uses that to know what is in the work-tree now. When you ask Git to switch to another commit, via git checkout or git reset --hard, Git can quickly compare the existing index to the new commit. Any files that have changed, Git must extract from the new commit (and update the index). Any files that are newly added, Git must also extract (and update the index). Any files that are gone—that are in the existing index, but not in the new commit—Git must remove ... and that's what Git does. Git updates, adds, and removes those files in the work-tree, as directed by the comparison between the current index, and the new commit.

What this means is that if you do need git clean, you must have done something outside Git that added files. These added files are not in the index, so by definition, they are untracked and/or ignored. If they are merely untracked, git clean -f would remove them, but if they are ignored, only git clean -fx will remove them. (You want -d just to remove directories that are or become empty during the cleaning.)

Abandoned commits and garbage collection

I mentioned, and drew in the updated shallow graph, that when we git fetch --depth 1 and then git reset --hard, we wind up abandoning the previous depth-1 shallow graph commit. (In the graph I drew, this was commit B.) However, in Git, abandoned commits are rarely truly abandoned—at least, not right away. Instead, some special names like ORIG_HEAD hang on to them for a while, and each reference—branches and tags are forms of reference—carries with it a log of "previous values".

You can display each reflog with git reflog refname. For instance, git reflog master shows you not only which commit master names now, but also which commits it has named in the past. There is also a reflog for HEAD itself, which is what git reflog shows by default.

Reflog entries eventually expire. Their exact duration varies, but by default they are eligible for expiration after 30 days in some cases and 90 days in others. Once they do expire, those reflog entries no longer protect abandoned commits (or, for annotated tag references, the annotated tag object—tags are not supposed to move, so this case is not supposed to occur, but if it does—if you force Git to move a tag—it's just handled in the same way as all other references).

Once any Git object—commit, annotated tag, "tree", or "blob" (file)—is really unreferenced, Git is allowed to remove it for real.2 It's only at this point that the underlying repository data for the commits and files goes away. Even then, it only happens when something runs git gc. Thus, a shallow repository updated with git fetch --depth 1 is not quite the same as a fresh clone with --depth 1: the shallow repository probably has some lingering names for the original commits, and won't remove the extra repository objects until those names expire or are otherwise cleared-out.


2Besides the reference check, objects get a minimum time before they expire as well. The default is two weeks. This prevents git gc from deleting temporary objects that Git is creating, but has yet to establish a reference to. For instance, when making a new commit, Git first turns the index into a series of tree objects which refer to each other but have no top-level reference. Then it creates a new commit object that refers to the top-level tree, but nothing yet refers to the commit. Last, it updates the current branch name. Until that last step finishes, the trees and new commit are unreachable!


Special considerations for --single-branch and/or shallow clones

I noted above that the name you give to git clone -b can refer to a tag. For normal (non-shallow or non-single-branch) clones, this works just as one would expect: you get a regular clone, and then Git does a git checkout by the tag name. The result is the usual detached HEAD, in a perfectly ordinary clone.

With shallow or single-branch clones, however, there are several unusual consequences. These are all, to some extent, a result of Git letting the implementation show through.

First, if you use --single-branch, Git alters the normal fetch configuration in the new repository. The normal fetch configuration depends on the name you choose for the remote, but the default is origin so I will just use origin here. It reads:

fetch = +refs/heads/*:refs/remotes/origin/*

Again, this is the normal configuration for a normal (not single-branch) clone. This configuration tells git fetch what to fetch, which is "all branches". When you use --single-branch, though, you get instead a fetch line that refers to only the one branch:

fetch = +refs/heads/zorg:refs/remotes/origin/zorg

if you're cloning the zorg branch.

Whichever branch you clone, that's the one that goes into the fetch line. Each future git fetch will obey this line,3 so you won't fetch any other branches. If you do want to fetch other branches later, you will have to alter this line, or add more lines.

Second, if you use --single-branch and what you clone is a tag, Git will put in a rather odd fetch line. For instance, with git clone --single-branch -b v2.1 ... I get:

fetch = +refs/tags/v2.1:refs/tags/v2.1

This means you will get no branches, and unless someone has moved the tag,4git fetch will do nothing!

Third, the default tag behavior is a bit weird due to the way git clone and git fetch obtain tags. Remember that tags are simply a reference to one particular commit, just like branches and all other references. There are two key differences between branches and tags, though: branches are expected to move (and tags are not), and branches get renamed (and tags don't).

Remember that all throughout the above, we keep finding that the other (upstream) Git's master becomes our origin/master, and so on. This is an example of the renaming process. We also saw, briefly, precisely how that renaming works, through the fetch = line: our Git takes their refs/heads/master and changes it to our refs/remotes/origin/master. This name is not only different-looking (origin/master), but literally can't be the same as any of our branches. If we create a branch named origin/master,5 this branch's "full name" is actually refs/heads/origin/master which is different from the other full name refs/remotes/origin/master. It's only when Git uses the shorter name that we have one (regular, local) branch named origin/master and another different (remote-tracking) branch named origin/master. (It's a lot like being at a group where everyone is named Bruce.)

Tags don't go through all this. The tag v2.1 is just named refs/tags/v2.1. This means there's no way to separate "their" tag from "your" tag. You can have either your tag, or their tag. As long as no one ever moves a tag, this doesn't matter: if you both have the tag, it must point to the same object. (If someone starts moving tags, things get ugly.)

In any case, Git implements the "normal" fetching of tags by a simple rule:6when Git already has a commit, if some tag names that commit, Git copies the tag too. With ordinary clones, the first clone gets all the tags, and then subsequent git fetch operations get the new tags. A shallow clone, however, by definition omits some commit(s), namely everything below any graft-point in the graph. Those commits won't pick up the tags. They can't: to have the tags, you would need to have the commits. Git is not allowed (except through the shallow grafts) to have the ID of a commit without actually having the commit.


3You can give git fetch some refspec(s) on the command line, and those will override the default. This applies only to a default fetch. You may also use multiple fetch = lines in the configuration, e.g., to fetch just a specific set of branches, although the normal way to "de-restrict" an initially-single-branch clone is to put back the usual +refs/heads/*:refs/remotes/origin/* fetch line.

4Since tags are not supposed to move, we could just say "this does nothing". If they do move, though, the + in the refspec represents the force flag, so the tag winds up moving.

5Don't do this. It's confusing. Git will handle it just fine—the local branch is in the local name space, and the remote-tracking branch is in the remote-tracking name space—but it's really confusing.

6This rule does not match the documentation. I tested against Git version 2.10.1; older Gits might use a different method. Git since 2.26 may also use different rules now that there is a newer, fancier protocol for git fetch and git push to use. If you care about the precise behavior with tags, you may need to test it on your particular Git version.


On the shallow clone update process itself, see commit 649b0c3 form Git 2.12 (Q1 2017).
That commit is part of:

Commit 649b0c3, commit f2386c6, commit 6bc3d8c, commit 0afd307 (06 Dec 2016) by Nguyễn Thái Ngọc Duy (pclouds). See commit 1127b3c, commit 381aa8e (06 Dec 2016) by Rasmus Villemoes (ravi-prevas). (Merged by Junio C Hamano -- gitster -- in commit 3c9979b, 21 Dec 2016)

shallow.c

This paint_down() is part of step 6 of 58babff (shallow.c: the 8 steps to select new commits for .git/shallow - 2013-12-05).
When we fetch from a shallow repository, we need to know if one of the new/updated refs needs new "shallow commits" in .git/shallow (because we don't have enough history of those refs) and which one.

The question at step 6 is, what (new) shallow commits are required in other to maintain reachability throughout the repository without cutting our history short?
To answer, we mark all commits reachable from existing refs with UNINTERESTING ("rev-list --not --all"), mark shallow commits with BOTTOM, then for each new/updated refs, walk through the commit graph until we either hit UNINTERESTING or BOTTOM, marking the ref on the commit as we walk.

After all the walking is done, we check the new shallow commits. If we have not seen any new ref marked on a new shallow commit, we know all new/updated refs are reachable using just our history and .git/shallow.
The shallow commit in question is not needed and can be thrown away.

So, the code.

The loop here (to walk through commits) is basically:

  1. get one commit from the queue
  2. ignore if it's SEEN or UNINTERESTING
  3. mark it
  4. go through all the parents and..
    • 5.a a mark it if it's never marked before
    • 5.b put it back in the queue

What we do in this patch is drop step 5a because it is not necessary.
The commit being marked at 5a is put back on the queue, and will be marked at step 3 at the next iteration. The only case it will not be marked is when the commit is already marked UNINTERESTING (5a does not check this), which will be ignored at step 2.