Git rebase loses history, then why rebase?

Solution 1:

Imagine you are working on a Secret Project of World Domination. There are three masterminds on this conspiracy:

  • The Genius
  • The General
  • The Computer Hacker

And they all agree to come to their secret base in 1 week each one with 1 detailed plan.

The computer hacker, being a pragmatic programmer, suggested that they use Git to store all the files of the plans. Each one will fork the initial project repo and they will merge all in one week.

They all agree and in the following days the story goes like this:

The Genius

He made a total of 70 commits, 10 each day.

The General

He spy the repo of their comrades and made an strategy to beat them. He made 3 commits, all the last day.

The Computer Hacker

This pragmatic programmer used branches. He made 4 different plans, each one on a branch. Each branch was rebased to be just one commit.

Seven days passed and the group meet again to merge all the plans into one master piece. All of them were eager to start so all of them tried to merge all the stuff on his own.

Here goes the story:

The Genius

He merged all the changes from the General's repo and then the Computer Hacker's one. And then, being a logic lover, he gave a look at the log. He expected to see a logical evolution of an idea, where the things were constructed upon the previous ideas-commits.

But what the logs shown, was a myriad of commits of different ideas all mixed in the time line. One reader could not really understand the evolution, the reasoning of the commits just by reading the commits time line.

So he ended with a mess, that even a genius could't understand.

The General

The General thought: Divide and conquer!

And so he merged the repo of the Genius on his repo. He looked at the log and saw a bunch of commits from the Genius idea, which followed an understable progression, until the last day. The last day the ideas of the General and the Genius were mixed.

He was spying the The computer Hacker and knew about the Rebase solution. So he did a rebase of the his own idea and try the merge again.

Now the log showed a logical progression every day.

The Computer Hacker

This pragmatic programmer created a integration branch for the Genius idea, another one for the General idea and another one for his own ideas. He did a rebase to each branch. And then he merged all in master.

And all of his team mates saw that his log was great. It was simple. It was understable at first sight.

If an idea introduced a problem, it was clear in which commit was introduced, for there was just one.

They ended conquering all the world and they vanished the use of Subversion.

And all were happy.

Solution 2:

As far as I can tell, rebasing removes all that history.

That's not correct. Rebasing, as the name suggests, changes the base of commits. Usually no commit is lost in that process (except that you don't get a merge commit). While your argument about keeping really everything of the development process inside the history the way it was done is correct, very often this leads to confusing histories.

Especially when working with others that each work on their own branches while requiring certain changes from others to continue (for example A asks B to implement something so that A can use that feature in his own development), this leads to many merges. For example like this:

     #--#--#--#--*-----*-----------------*---#---\         Branch B
    /           /     /                 /         \
---#-----#-----#-----#-----#-----#-----#-----#-----*       Branch A

In this example we have a branch that works separately for the time but constantly pulls in changes from the original branch (# are original commits, * are merges).

Now if we do a rebase on Branch B before merging in back in, we could get the following:

                             #--#--#--#--#---\         Branch B
                            /                 \
---#---#---#---#---#---#---#---#---------------*       Branch A

This represents the same actual changes, but B was rebased to some older commit on A, so all merges on B that were done before are no longer needed (because those changes are already there in that older commit). And all commits that are missing now, are the merges, which usually do not contain any information about the development process. (Note that in this example you could also rebase that last commit on A later on, to get a straight line, effectively removing any hints to the second branch)

Solution 3:

You do a rebase mainly to rework your local commits (the one you haven't pushed yet) on top of a remote branch (you just fetch), in order to solve any conflict locally (i.e. before you push them back to the upstream repo).
See "git workflow and rebase vs merge questions" and, quite detailed: "git rebase vs git merge" .

But rebase isn't limited to that scenario, and combined with "--interactive", it allows for some local re-ordering and cleaning of your history. See also "Trimming GIT Checkins/Squashing GIT History".

why wouldn't you want the repo history to reflect all the ways the code developed, including where and how it diverged

  • In a centralized VCS, it is important to never lose the history, and it should indeed reflect "all the ways the code developed".
  • In a distributed VCS, where you can do all kind of local experiments before publishing some of your branches to upstream, it makes less sense to keep everything within the history: not everyone needs to clone and see all of your branches, tests, alternatives, and so on.

Solution 4:

Organizing your history is the point of using rebase over merge, and it's extremely valuable.

What use is a git history which accurately reflects every code change of the past? Do you need such a thing for some kind of certification effort? If you don't, why do you want that? The past as it really happened is messy and difficult to understand. I mean, why not also include every character which you wrote then deleted while editing the file?

The most common way you'll use your git history is reading it. Finding which commit caused an issue and exploring the different versions of a file are probably the two most common use cases. Both these use cases become much simpler and convenient when your git history is straight (and clean!).

Perhaps even more importantly than using rebase to share changes with the rest of the team, each member should use rebase to format their changes into a logical collection of self-contained commits. Development doesn't naturally occur in logical steps that directly follow each other. Sometimes you just push a commit on your branch just because it's the end of the day and you have to go. Putting this kind of information in your git history is pure noise. I routinely squash a feature that took 20 commits down to just one or two, because there's just no point showing anything which didn't end up being part of the finished product.

Even if the development history of your feature was an unholy mess, you can and absolutely should craft an utopic git history. You get everything right the first time in the correct order, you did feature A on day 1 and feature B on day 2, there were no bugs or temporary print statements. Why should you do that? Because it's easier to understand for someone reading your changes.

If you combine this idea with git bisect then curating your master history to only contain commits which pass all the tests defined at the time becomes even more helpful. It will be trivial to find the origin point of a bug, as git bisect will just work. If you use merge and upload the entire development history of each of your branches to master, there is no chance of bisect being actually helpful.