Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design [closed]

Introduction and Background

We are in the process of changing source control system and we are currently evaluating git and mercurial. The total code base is around 6 million lines of code, so not massive and not really small either.

Let me first start off with a very brief introduction to how the current repository design looks.

We have one base folder for the complete code base, and beneath that level there are all sorts modules used in several different contexts. For example “dllproject1” and “dllproject2” can be looked at as completely separate projects.

The software we are developing is something we call a configurator, which can be customized endlessly for different customer needs. At total we probably have 50 different versions of them. However, they have one thing in common. They all share a couple of mandatory modules (mandatory_module1 ..). These folders basically contain kernel/core code and common language resources etc. All customizations can then be any combination between the other modules (module1 ..).

Since we currently are using cvs we've added aliases in the CVSROOT/modules file. They might look something like:

core –a mandatory_module1 mandatory_module2 mandatory_module3
project_x –a module1 module3 module5 core

So if someone decides to work on project_x, he/she can quickly checkout the modules needed by:

base>cvs co project_x

Questions

Intuitively it just feels wrong to have the base folder as a single repository. As a programmer you should be able to check out the exact code sub set needed for the current project you are working with. What are your thoughts on this?

On the other hand it feels more right to have each of these modules in separate repositories. But this makes it harder for programmers to check out the modules that they need. You should be able to do this by a single command. So my question is: Are there similar ways of defining aliases in git/mercurial?

Any other questions, suggestions, pointers are highly welcome!

PS. I have searched for similar questions but didn’t feel that any of them applied 100% to my situation.


Solution 1:

Just a quick comment to remind you that:

  • those migrations often offer the opportunity to reorganize the sources, not along modules (each with one repositories) but rather along a functional domain split (several modules for a same given functional domain being put in the same repository).

Then submodules are to be used, as a way to define a configuration.

  • Git is alright, but from Linus's admission himself, to put everything into one repository can be problematic.

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.


Those two aforementioned points advocate for a more component-oriented approach for large system (and large legacy repository).

With Git submodule, you can checkout them in your project (even if it is a two-steps process). You have however tools than can make the submodule management easier (git.rake for instance).


When I'm thinking of fixing a bug in a module that's shared between several projects, I just fix the bug and commit it and all just do their updates

That is what I describe in the post Vendor Branch as the "system approach": everyone works on the latest (HEAD) of everything, and it is effective for small number of projects.
For a large number of modules though, the notion of "module" is still very useful, but its management is not the same with DVCS:

  • for closely related modules (aka "in the same functional domain", like "all modules related to PNL - Profit aNd Losses - or "Risk analysis", in a financial domain), you do need to work with the latest (HEAD) of all components involved.
    That would be achieved with the use of a subtree strategy, not in order for you to publish (push) corrections on those other submodules, but to track works done by other teams.
    Git allows that with the extra-bonus that this "tracking" does not have to take place between your repository and one "central" repository, but can also take place between you and the local repository of the other team, allowing for a very quick back-and-forth integration and testing between projects of similar nature.

  • however, for modules which are not directly in your functional domain, submodules are a better option, because they refer to a fix version of a module (a commit):
    when a low-level framework changes, you do not want it to be propagated instantaneously, since it would impact all the other teams, which would then have to drop what they were doing to adapt their code to that new version (you do want though all the other teams to be aware of this new version, in order for them to not forget to update that low-level component or "module").
    That allows you to work only with official stable identified versions of other modules, and not potentially un-stabled or not fully tested HEADs.

Solution 2:

As for the Mercurial side, the recommendation is also to refactor large legacy CVS/SVN repositories into smaller components. Common code should be put into its own libraries, and the application code will then depend on those libraries in a similar way to how it depends on other libraries.

Mercurial has the forest extension which allows you to manage a "forest" of "source trees". With that approach you combine several smaller repositories into a larger one. With CVS you do the opposite: you checkout a smaller portion of a large repository.

I have not personally used the forest extension and its page says that one should use an updated version compared to the one bundled with Mercurial. However, it is used by a big organization like Sun in its OpenJDK project.

There is also currently work underway to add sub-repository report directly to the core of Mercurial, as per the design on nested repositories page in the Mercurial wiki.