Why would I want stage before committing in Git?
When you commit it's only going to commit the changes in the index (the "staged" files). There are many uses for this, but the most obvious is to break up your working changes into smaller, self-contained pieces. Perhaps you fixed a bug while you were implementing a feature. You can git add
just that file (or git add -p
to add just part of a file!) and then commit that bugfix before committing everything else. If you are using git commit -a
then you are just forcing an add
of everything right before the commit. Don't use -a
if you want to take advantage of staging files.
You can also treat the staged files as an intermediate working copy with the --cached
to many commands. For example, git diff --cached
will show you how the stage differs from HEAD
so you can see what you're about to commit without mixing in your other working changes.
- Staging area gives the control to make commit smaller. Just make one logical change in the code, add the changed files to the staging area and finally if the changes are bad then checkout to the previous commit or otherwise commit the changes.It gives the flexibility to split the task into smaller tasks and commit smaller changes. With staging area it is easier to focus in small tasks.
- It also gives you the offer to take break and forgetting about how much work you have done before taking break. Suppose you need to change three files to make one logical change and you have changed the first file and need a long break until you start making the other changes. At this moment you cannot commit and you want to track which files you are done with so that after coming back you do not need to try to remember how much work have been done. So add the file to the staging area and it will save your work. When you come back just do
git diff --staged
and check which files you changed and where and start making other changes.
One practical purpose of staging is logical separation of file commits.
As staging allows you to continue making edits to the files/working directory, and make commits in parts when you think things are ready, you can use separate stages for logically unrelated edits.
Suppose you have 4 files fileA.html
, fileB.html
, fileC.html
and fileD.html
. You make changes to all 4 files and are ready to commit but changes in fileA.html
and fileB.html
are logically related (for example, same new feature implementation in both files) while changes in fileC.html
and fileD.html
are separate and logically unrelated to previous to files. You can first stage files fileA.html
and fileB.html
and commit those.
git add fileA.html
git add fileB.html
git commit -m "Implemented new feature XYZ"
Then in next step you stage and commit changes to remaining two files.
git add fileC.html
git add fileD.html
git commit -m "Implemented another feature EFG"
To expand on Ben Jackson's answer, which is fine, let's look at the original question closely. (See his answer for why bother type questions; this is more about what is going on.)
I'm new to version control and I understand that "committing" is essentially creating a backup while updating the new 'current' version of what you're working on.
This isn't quite right. Backups and and version control are certainly related—exactly how strongly depends on some things that are to some extent matters of opinion—but there are certainly some differences, if only in intent: Backups are typically designed for disaster recovery (machine fails, fire destroys entire building including all storage media, etc.). Version control is typically designed for finer-grained interactions and offers features that backups don't. Backups are typically stored for some time, then jettisoned as "too old": a fresher backup is all that matters. Version control normally saves every committed version forever.
What I don't understand is what staging for is from a practical perspective. Is staging something that exists in name only or does it serve a purpose? When you commit, its going to commit everything anyway, right?
Yes and no. Git's design here is somewhat peculiar. There exist version control systems that don't require a separate staging step. For instance, Mercurial, which is otherwise a lot like Git in terms of usage, doesn't require a separate hg add
step, beyond the very first one that introduces an all-new file. With Mercurial, you use the hg
command that selects some commit, then you do your work, then you run hg commit
, and you're done. With Git, you use git checkout
,1 then you do your work, then you run git add
, and then git commit
. Why the extra git add
step?
The secret here is what Git calls, variously, the index, or the staging area, or sometimes—rarely these days—the cache. These are all names for the same thing.
Edit: I think I may be confusing the terminology. Is a 'staged' file the same thing as a 'tracked' file?
No, but these are related. A tracked file is one that exists in Git's index. To properly understand the index, it's good to start with understanding commits.
1Since Git version 2.23, you can use git switch
instead of git checkout
. For this particular case, these two commands do exactly the same thing. The new command exists because git checkout
got over-stuffed with too many things; they got split out into two separate commands, git switch
and git restore
, to make it easier and safer to use Git.
Commits
In Git, a commit saves a full snapshot of every file that Git knows about. (Which files does Git know about? We'll see that in the next section.) These snapshots are stored in a special, read-only, Git-only, compressed and de-duplicated form, that in general only Git itself can read. (There's more stuff in each commit than just this snapshot, but that's all we will cover here.)
The de-duplication helps with space: we normally only change a few files, then make a new commit. So most of the files in a commit are mostly the same as the files in the previous commit. By simply re-using those files directly, Git saves lots of space: if we only touched one file, the new commit only takes space for one new copy. Even then it's compressed—sometimes very compressed, though this actually happens later—so that a .git
directory can actually be smaller than the files it contains, once they're expanded out to normal everyday files. The de-duplication is safe because the committed files are frozen for all time. Nobody can go change one, so it's safe for commits to depend on each others' copies.
Because the stored files are in this special, frozen-for-all-time, Git-only format, though, Git has to expand out each file into an ordinary everyday copy. This ordinary copy isn't Git's copy: it is your copy, to do with as you will. Git will just write to these when you tell it to do so, so that you have your copies to work with. These usable copies are in your working tree or work-tree.
What this means is that when you check out some particular commit, there are automatically two copies of each file:
-
Git has a frozen-for-all-time, Git-ified copy in the current commit. You can't change this copy (though you can of course select a different commit, or make a new commit).
-
You have, in your work-tree, a normal-format copy. You can do anything you want to this, using any of the commands on your computer.
Other version control systems (including Mercurial as mentioned above) stop here, with these two copies. You just modify your work-tree copy, then commit. Git ... doesn't.
The index
In between these two copies, Git stores a third copy2 of every file. This third copy is in the frozen format, but unlike the frozen copy in the commit, you can change it. To change it, you use git add
.
The git add
command means make the index copy of the file match the work-tree copy. That is, you are telling Git: Replace the frozen-format, de-duplicated copy that's in the index now, by compressing my updated work-tree copy, de-duplicating it, and getting it ready to be frozen into a new commit. If you don't use git add
, the index still holds the frozen-format copy from the current commit.
When you run git commit
, Git packages up whatever is in the index right then to use as the new snapshot. Since it's already in the frozen format, and pre-de-duplicated, Git does not have to do a lot of extra work.
This also explains what untracked files are all about. An untracked file is a file that is in your work-tree but isn't in Git's index right now. It doesn't matter how it the file wound up in this state. Maybe you copied it from some other place on your computer, into your work-tree. Maybe you created it fresh here. Maybe there was a copy in Git's index, but you removed that copy with git rm --cached
. One way or another, there is a copy here in your work-tree, but there isn't a copy in Git's index. If you make a new commit now, that file won't be in the new commit.
Note that git checkout
initially fills in Git's index from the commit you check out. So the index starts out matching the commit. Git also fills in your work-tree from this same source. So, initially, all three match. When you change files in your work-tree and git add
them, well, now the index and your work-tree match. Then you run git commit
and Git makes a new commit from the index, and now all three match again.
Because Git makes new commits from the index, we can put things this way: Git's index holds the next commit you plan to make. This ignores the expanded role that Git's index takes on during a conflicted merge, but we'd like to ignore that for now anyway. :-)
That's all there is to it—but it's still pretty complicated! It's particularly tricky because there's no easy way to see exactly what is in Git's index.3 But there is a Git command that tells you what's going on, in a way that's pretty useful, and that command is git status
.
2Technically, this isn't actually a copy at all. Instead, it's a reference to the Git-ified file, pre-de-duplicated and everything. There's more stuff in here as well, such as the mode, file name, a staging number, and some cache data to make Git go fast. But unless you get into working with some of Git's low-level commands—git ls-files --stage
and git update-index
in particular—you can just think of it as a copy.
3The git ls-files --stage
command will show you the names and staging numbers of every file in Git's index, but usually this isn't very useful anyway.
git status
The git status
command actually works by running two separate git diff
commands for you (and also doing some other useful stuff, such as telling you which branch you're on).
The first git diff
compares the current commit—which, remember, is frozen for all time—to whatever is in Git's index. For files that are the same, Git will say nothing at all. For files that are different, Git will tell you that this file is staged for commit. This includes all-new files—if the commit doesn't have sub.py
in it, but the index does have sub.py
in it, then this file is added—and any removed files, that were (and are) in the commit but aren't in the index any more (git rm
, perhaps).
The second git diff
compares all the files in Git's index to the files in your work-tree. For files that are the same, Git says nothing at all. For files that are different, Git will tell you that this file is not staged for commit. Unlike the first diff, this particular list doesn't include files that are all-new: if the file untracked
exists in your work-tree, but not in Git's index, Git just adds it to the list of untracked files.4
At the end, having accumulated these untracked files in a list, git status
will announce those files' names too, but there's a special exception: if a file's name is listed in a .gitignore
file, that suppresses this last listing. Note that listing a tracked file—one that's in Git's index—in a .gitignore
has no effect here: the file is in the index, so it gets compared, and gets committed, even if it's listed in .gitignore
. The ignore file only suppresses the "untracked file" complaints.5
4When using the short version of git status
—git status -s
—the untracked files aren't as separated-out, but the principle is the same. Accumulating the files like this also lets git status
summarize a bunch of untracked files' names by just printing a directory name, sometimes. To get the full list, use git status -uall
or git status -u
.
5Listing a file also makes en-masse add many file operations like git add .
or git add *
skip over the untracked file. This part gets a little more complicated, since you can use git add --force
to add a file that would normally be skipped. There are some other normally-minor special cases, all of which add up to this: the file .gitignore
might be more properly called .git-do-not-complain-about-these-untracked-files-and-do-not-auto-add-them
or something equally unwieldy. But that's too ridiculous, so .gitignore
it is.
git add -u
, git commit -a
, etc
There are several handy shortcuts to know about here:
-
git add .
will add all updated files in the current directory and any sub-directory. This respects.gitignore
, so if a file that is currently untracked is not complained-about bygit status
, it won't be auto-added. -
git add -u
will auto-add all updated files anywhere in your work-tree.6 This affects only tracked files. Note that if you've removed the work-tree copy, this will remove the index copy too (git add
does this as part of its make the index match the work-tree thing). -
git add -A
is like runninggit add .
from the top level of your work-tree (but see footnote 6).
Besides these, you can run git commit -a
, which is roughly equivalent7 to running git add -u
and then git commit
. That is, this gets you the same behavior that is convenient in Mercurial.
I generally advise against the git commit -a
pattern: I find that it's better to use git status
often, look closely at the output, and if the status is not what you expected, figure out why that's the case. Using git commit -a
, it's too easy to accidentally modify a file and commit a change you didn't intend to commit. But this is mostly a matter of taste / opinion.
6If your Git version predates Git 2.0, be careful here: git add -u
only works on the current directory and sub-directories, so you must climb to the top level of your work-tree first. The git add -A
option has a similar issue.
7I say roughly equivalent because git commit -a
actually works by making an extra index, and using that other index to do the commit. If the commit works, you get the same effect as doing git add -u && git commit
. If the commit doesn't work—if you make Git skip the commit in any of the many ways you can do that—then no files are git add
-ed afterward, because Git throws out the temporary extra index and goes back to using the main index.
There are additional complications that come in if you use git commit --only
here. In this case, Git creates a third index, and things get very tricky, especially if you use pre-commit hooks. This is another reason to use separate git add
operations.