How to unstage a file during interactive rebase (remove file from old commit)?

Solution 1:

When you talk about removing a file during interactive rebase, you might mean one of two things:

  • Making the file match the previous commit. Some would call this removing the changes to the file, and since some people think of commits as changes, some people will then shorten this to removing the file.

  • Literally removing the file, so that your new-and-improved commit omits the file.

Both are relatively easy to do. Before I say how to do it, I'll put in some background.

Background

To understand what you're doing and why, it helps to have the right mental model of Git commits:

  • Each commit has a unique number (hash ID or OID, where OID stands for Object ID). The commit's number is how Git really finds the commit: branch names don't actually matter.

  • All Git commits—in fact, all Git internal objects—are completely read-only. You can't change a commit, so that's not what git rebase really does.

  • Each commit stores two things: (a) a snapshot of all files, and (b) some metadata.

The files in a commit snapshot are compressed and Git-ified and, importantly, de-duplicated (both within and across commits) when the contents of some file match the contents of some other file. So the fact that most commits mostly contain all the files from most previous commits doesn't cause the repository to bloat: these duplicated files are stored only once. Git handles this all invisibly, and pretty well, as long as you don't store large, incompressible binary files in Git (if you do, then Git handles this poorly and the repository bloats up and becomes unusable). But this also means that commits don't store changes.

The metadata in a commit records stuff like who made the commit, when, and why (their log message), but it also records data that's crucial for Git's internal operation: each commit stores a list of parent commit hash IDs. This list is usually exactly one element long, giving each commit a single parent. The parent commit has a snapshot too, and to turn a commit into changes (for viewing purposes), Git extracts both snapshots and sees which files are changed. Because of the de-duplication, Git can short-circuit this and not bother extracting the identical files at all; it then only has to come up with a change-recipe for files that don't match in the two commits. That's what you see with git show or git log -p: a diff from the commit's (single) parent's snapshot to the commit's snapshot.

Because commits are read-only, and only Git itself can read them, we don't actually work on or with the commits. Instead, when we pick some commit to use, we have Git extract the commit, like un-tar-ing or un-rar-ing some archive, into a working area, which Git calls our working tree or work-tree. While these work-tree files are extracted from Git, they're not actually in Git at all, while you get your work done.

Because commits are read-only, git rebase can't fix any bad commit, nor does git commit --amend change a commit. Instead, Git makes use of the fact that humans—unlike Git itself—never1 find a commit by hash ID. Instead, we use branch names. A branch name simply holds the hash ID of the last commit we want to claim to be "part of the branch". That commit then holds, in its metadata, the hash ID of the previous commit, which holds in its metadata the hash ID of another even-earlier commit, and so on. This produces a simple backwards-looking chain:

... <-F <-G <-H   <--branch

where the branch name holds the hash ID of the last commit H in the chain, and everything works backwards from there.

When we add commits in the normal everyday fashion, Git makes a new (read-only) commit I whose parent is H, adds that to the chain, and writes the hash ID of new commit I into the branch name:

...--G--H--I   <-- branch (HEAD)

To "amend" commit H, Git simply writes new commit I with commit G as its parent, instead of commit H, resulting in:

       H
      /
...--G--I   <-- branch (HEAD)

Commit H still exists, but unless we've memorized its hash ID, we'll never see it again. (Git can, as long as Git can find its hash ID.)

Rebasing simply consists of making more than one new-and-improved commit. If we have:

...--F--G--H   <-- main
         \
          I--J   <-- feature (HEAD)

and we want a revised I to appear after H instead of before / in-parallel-with it, we make a new snapshot-and-metadata commit I' (that otherwise looks like I, except that we pick up H's snapshot as our "base" and "re-add" our changes from I, hence "re-base"-ing I):

             I'  <-- HEAD [detached]
            /
...--F--G--H   <-- main
         \
          I--J   <-- feature

We then repeat this for commit J to get J':

             I'-J'  <-- HEAD [detached]
            /
...--F--G--H   <-- main
         \
          I--J   <-- feature

Once we've copied all the commits to new-and-improved ones, we have Git move the name feature to point to the last copied commit:

             I'-J'  <-- feature (HEAD)
            /
...--F--G--H   <-- main
         \
          I--J   [abandoned]

The original commits still exist; we just can't find them.


1(insert Gilbert & Sullivan HMS Pinafore routine here)


Interactive rebase

Interactive rebase uses the same process as non-interactive rebase,2 but lets us stop and make adjustments. To do that, Git provides us with an instruction sheet. It contains, initially, a series of pick commands for each commit that we will copy. These instruct Git to run git cherry-pick, which is the step that copies a commit, like I to I' above.

Changing pick to edit makes Git do the cherry-pick, but then stop in the detached-HEAD mode. Note that here we're copying I to I' and placing it in the same physical position as before, rather than moving it to come after commit H:

          I'  <-- HEAD [detached]
         /
...--G--H   <-- main
         \
          I--J   <-- feature

Now that we're in this state, we can use git commit --amend to make yet another commit, I". In this commit, we can store any snapshot we like, and use any commit message we like. The parent of I" will be H, the same as the parent of I and I'.

The snapshot that goes into the new commit has the same source as any new Git commit: it comes from Git's index AKA staging area. This currently contains all the files from commit I', which match all the files from commit I (and hence do and will use no space as they're all duplicates that are pre-de-duplicated already). These are the Git-ified copies of the files that are also in your work-tree. So you can modify or remove the file in your work-tree and run git add:

vim foo.py
git add foo.py

or:

rm foo.py
git add foo.py

The git add step tells Git to make the index copy match the working tree copy, by reading and compressing and de-duplicating the file, or—after removing foo.pyremoving the index copy entirely. Or:

git rm foo.py

combines the rm and git add into a single step. Either way we've arranged for the correct (updated or removed) file to be in Git's index, so we now run git commit --amend, just as you did:

git commit --amend

This shoves commit I' up out of the way, leaving commit I" pointing to H:

         I'  [abandoned]
        /
        | I"  <-- HEAD [detached]
        |/
...--G--H   <-- main
         \
          I--J   <-- feature

Running git rebase --continue tells the rebase code to proceed on to the next instruction in the instruction sheet: another pick, or edit, or reword, or whatever. Once the last instruction has been followed, rebase will yank the branch name around as before:

         I'  [abandoned]
        /
        | I"-J'  <-- feature (HEAD)
        |/
...--G--H   <-- main
         \
          I--J   [abandoned]

(The abandoned commits sit around for a while—at least 30 days by default, in the usual setup—and then Git eventually notices that they've gone unused for long enough, drops the reflog entries that are keeping them alive, and purges them for real. Until then, though, you can easily get the originals back. Note that the special name ORIG_HEAD also remembers commit J for a while, until you do something else that has Git overwrite ORIG_HEAD with another hash ID. Right after a successful rebase, if you don't like the result, ORIG_HEAD works just as well as the reflog entry in branch@{1}.)


2In older versions of Git, there were numerous technical differences. In modern Git, these are largely gone now, though you can still invoke them on purpose if you really want to. I will also elide a number of optimizations Git normally uses for the kind of interactive rebase you'll be doing, that do make things better for Git but don't change the final outcome here.


We can now see what git reset or git restore will do

Why neither git restore --staged file1 nor git reset HEAD file1 works?

Both git reset and git restore will read a file's content from somewhere and write that file's content to somewhere. The git reset command itself is absurdly complicated, so it's better to stick with the newer, better-focused (more limited) git restore in my opinion, but either one will work: we just have to know several things here.

git reset HEAD^ -- F2 //reset F2 to previous version in staging area

Here, we're using git reset, not git restore, in its restore-one-file mode of operation. If we use:

git reset HEAD -- file1

we are telling Git: read the Git-ified copy of file1 from the commit specified by HEAD. If we use:

git reset HEAD^ -- F2

we are telling Git: read the Git-ified copy of F2 from the commit specified by HEAD^.

In both cases, having read the specified file from the specified commit, git reset writes the (Git-ified, pre-de-duplicated) content into the index / staging-area, ready to go into the new commit. The name of the file in the staging area is the same as the name of the file in the chosen commit (file1 or F2). The working tree copy of the file is not changed here! This is undesirable, since it makes it hard to see what you're doing, but since Git isn't actually using the working tree copy at this point, it's not exactly harmful right now either.

Using git checkout is better:

git checkout HEAD^ -- F2

This form of git checkout—which, like git reset, is absurdly complicated, which in turn is why git checkout was split into git switch and git restore in Git 2.23—reads a file from the specified commit and writes it to both Git's index and your working tree. This makes it much easier to see what you have done, since the working tree copy is now obvious.

If your goal is to make the copy of F2 in the new I" commit match the copy in commit H, these HEAD^ forms of the command will do the trick. The reason is that HEAD currently names commit I', the copy of commit I. I''s parent is H, so retrieving the copy of file F2 (or file1) from commit H will restore the index and working tree version to match that in H, and now the commit you make with git commit --amend—the I" commit—has in it the same copy of that file (de-duplicated).

If your goal is to truly remove F2 entirely, so that commit I" has no file F2 at all, git rm F2 (or git rm -- F2 to avoid problems with files named --cached, for instance) will do that.

If we want to make F2 match the copy in H, but using git restore to avoid the overly-complicated-checkout-command-related errors, we'd run:

git restore -SW --source=HEAD^ -- F2

for instance. This does the same as the git checkout: we specify HEAD^ as the source for the file, -S (--staged) to tell git restore to write the file to the staging area, and -W (--worktree) to tell git restore to write the file to our working tree.

Note that in all cases, our goal here is to make the index contain the correct files as git commit --amend is going to make the new snapshot from Git's index. Being humans, we should generally update the working tree copy of these files at the same time, since we can't see the index (staging area) copy, but we can see, in whatever editor or file viewer we prefer, the working tree copy.

We must also remember that if and when we run git status, Git will run two git diff --name-status operations for us:

  • One will compare the HEAD commit vs Git's index. But the HEAD commit is commit I', not commit H! So we need not to pay too much attention to this.
  • The other will compare Git's index vs our working tree. This diff should, ideally, be empty, so that we're looking at the same files in our working tree as Git will be using in our next commit.

The reset --soft option

There's one other thing we can do, which I myself never actually do: instead of git commit --amend, we can start out the whole amending process with git reset --soft. That is, we start git rebase -i, change one pick to edit, write out the instruction sheet, and let rebase begin. We're now in this state:

          I'  <-- HEAD [detached]
         /
...--G--H   <-- main
         \
          I--J   <-- feature

The git reset --soft command lets us move the detached HEAD without changing either Git's index or our working tree at all. Running git reset --soft HEAD^ produces this:

          I'  [abandoned]
         /
...--G--H   <-- main, HEAD [detached]
         \
          I--J   <-- feature

That is, we give up commit I' right away. We have most of what we want in Git's index / staging-area and in our working tree. By giving up I' entirely, we now have things arranged as if we'd never made commit I at all: the current commit is now H, not I.

We can now git restore -SW --source HEAD -- file1, if that's what we want. In fact, with -S, --source HEAD is the default, so we can shorten this to:

git restore -SW -- file1

which will copy the committed file1 from commit H—our now-adjusted HEAD—to both Git's index and our working tree, discarding any changes we had made in commit I. Now git status and git diff --cached give us the same results we'd get if we were doing this commit for the first time.

(It might be nice if the edit mode of rebase -i had always done this automatically, but it doesn't, and it's now far too late to change it.)