How does Git solve the merging problem? [closed]
SVN made branching much easier by making branches really cheap, but merges remain a real problem in SVN - one that Git supposedly solves.
Does Git achieve this, and how?
(disclaimer: All I know about Git is based on the Linus lecture - total git noob here)
Git will not prevent conflict in merges but can reconcile history even when they do not share any parent ancestor.
(through The grafts file (.git/info/grafts
), which is a list, one per line, of a commit followed by its parents, that you can modify for that "reconciliation" purpose.)
So pretty powerful right there.
But to really have a glimpse on "how merges have been thought through", you can start by turning to Linus himself, and realize this issue is not so much about "algorithm":
Linus: Me personally, I want to have something that is very repeatable and non-clever. Something I understand or tells me that it can't do it.
And quite frankly, merging single-file history without taking all the other files' history into account makes me go "ugh".The important part of a merge is not how it handles conflicts (which need to be verified by a human anyway if they are at all interesting), but that it should meld the history together right so that you have a new solid base for future merges.
In other words, the important part is the trivial part: the naming of the parents, and keeping track of their relationship. Not the clashes.
And it looks like 99% of SCM people seem to think that the solution to that is to be more clever about content merges. Which misses the point entirely.
So Wincent Colaiuta adds (emphasis mine):
There is no need for fancy metadata, rename tracking and so forth.
The only thing you need to store is the state of the tree before and after each change.What files were renamed? Which ones were copied? Which ones were deleted? What lines were added? Which ones were removed? Which lines had changes made inside them? Which slabs of text were copied from one file to another?
You shouldn't have to care about any of these questions and you certainly shouldn't have to keep special tracking data in order to help you answer them: all the changes to the tree (additions, deletes, renames, edits etc) are implicitly encoded in the delta between the two states of the tree; you just track what is the content.Absolutely everything can (and should) be inferred.
Git breaks the mould because it thinks about content, not files.
It doesn't track renames, it tracks content. And it does so at a whole-tree level.
This is a radical departure from most version control systems.
It doesn't bother trying to store per-file histories; it instead stores the history at the tree level.
When you perform a diff you are comparing two trees, not two files.The other fundamentally smart design decision is how Git does merges.
The merging algorithms are smart but they don't try to be too smart. Unambiguous decisions are made automatically, but when there's doubt it's up to the user to decide.
This is the way it should be. You don't want a machine making those decisions for you. You never will want it.
That's the fundamental insight in the Git approach to merging: while every other version control system is trying to get smarter, Git is happily self-described as the "stupid content manager", and it's better for it.
It is now generally agreed on that 3-way merge algorithm (perhaps with enhancements such like rename detection and dealing with more complicated history), which takes into account version on current branch ('ours'), version on merged branch ('theirs'), and version of common ancestor of merged branches ('ancestor') is (from the practical point of view) the best way to resolve merges. In most cases, and for most of the contents tree level merge (which version of file to take) is enough; there rarely is need for dealing with contents conflicts, and then diff3 algorithm is good enough.
To use 3-way merge you need to know common ancestor of merged branches (co called merge base). For this you need to know full history between those branches. What Subversion before (current) version 1.5 was lacking (without third party tools such like SVK or svnmerge) was merge tracking, i.e. remembering for merge commit what parents (what commits) were used in merge. Without this information it is not possible to calculate correctly common ancestor in the presence of repeated merges.
Take for account the following diagram:
---.---a---.---b---d---.---1 \ / \-.---c/------.---2
(which would probably get mangled... it would be nice to have ability to draw ASCII-art diagrams here).
When we were merging commits 'b' and 'c' (creating commit 'd'), the common ancestor was the branching point, commit 'a'. But when we want to merge commits '1' and '2', now the common ancestor is commit 'c'. Without storing merge information we would have to conclude wrongly that it is commit 'a'.
Subversion (prior to version 1.5), and earlier CVS, made merging hard because you had to calculate common ancestor yourself, and give information about ancestor manually when doing a merge.
Git stores information about all parents of a commit (more than one parent in the case of merge commit) in the commit object. This way you can say that Git stores DAG (direct acyclic graph) of revisions, storing and remembering relationships between commits.
(I am not sure how Subversion deals with the issues mentioned below)
Additionally merging in Git can deal with two additional complication issues: file renames (when one side renamed a file, and other didn't; we want to get rename, and we want to get changes applied to correct file) and criss-cross merges (more complicated history, when there is more than one common ancestor).
- File renames during merge are managed using heuristic similarity score based (both similarity of file contents and similarity of pathname is taken into account) rename detection. Git detects which files correspond to each other in merged branches (and ancestor(s)). In practice it works quite well for real world cases.
- Criss-cross merges, see definition at revctrl.org wiki, (and presence of multiple merge bases) are managed by using recursive merge strategy, which generates single virtual common ancestor.
Answers above are all correct, but I think they miss the centerpoint of git's easy merges for me. An SVN merge requires you to keep track and remember what's been merged and that's a huge PITA. From their docs:
svn merge -r 23:30 file:///tmp/repos/trunk/vendors
Now that's not killer, but if you forget whether it's 23-30 inclusive or 23-30 exclusive, or whether you've already merged some of those commits, you're hosed and you've got to go figure out the answers to avoid repeating or missing commits. God help you if you branch a branch.
With git it's just git merge and all this happens seamlessly, even if you've cherry-picked a couple commits or done any number of fantastical git-land things.
As far as I know, the merging algorithms are not any smarter than those in other version control systems. However, because of git's distributed nature, there is no need for centralized merging efforts. Every developer can rebase or merge small changes from other developers into his tree at any time, thus the conflicts that arise tend to be smaller.