Migrate multiple svn repositories into single git repository

We want to migrate from svn to git permanently to be able to use git's better features in terms of branching and collaboration.

Our current svn repository looks like this

svnrepo/
   frontend/
      trunk
      branches/
         ng/
         ...
      tags/
         1.x
         ...
   backend/
      trunk
      branches/
         ng/
         ...
      tags/
         1.x
         ...

The working layout is that we check out the frontend project and inside this, we create a backend folder and checkout the backend project.

We now want to migrate to git, and give up the splitting between frontend and backend (in terms of being separate projects) because it gives us more problems than advantages. We want them both to be in a single git repository.

I wanted to use svn2git for the conversion. Unfortunatly the latest development all happened in a branch, and not in trunk, but I think this should not be a problem for svn2git. So the new git repository layout should look like this:

/            => svnrepo/frontend/branches/ng
/backend     => svnrepo/backend/branches/ng

Where => means "migrated/converted from".

For the conversion it is not necessary for us to convert all the tags and branches from the svn repository over to git. This is not important for us. What is important however is, that we have the full history of all commits to all files in the branches/ng directory, going back to the branching from trunk and all commits that happened in trunk before that. And we want all these commits to be with the mentioned layout in a single git repository. Is this even possible? And how would we do this?

I already searched with google and also in stackoverflow 1,2 but could not find an exact solution for our problem.


Solution 1:

One solution would be to generate each of the repositories separately with svn2git or just git svn (it's a nice little tool already built into git), and then wire them together with git filter-branch.

  1. Clone each svn repository individually.
  2. In the repository you want to be root, add the other repositories as remotes, and fetch their branches you want to merge to that repo (you'll get warnings since the branches have no common history; that is expected).
  3. Execute git filter-branch on those new branches, using an index filter to generate a new subdirectory for them.
  4. Merge the filtered branches into master (or whatever branch you wanted) on the root repository. Full history would be preserved.

The command for step 3 would look something like this:

git filter-branch --index-filter '
    git ls-files -s |
    perl -pe "s{\t\"?}{$&newsubdir/}" |
    GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info &&
    mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE
' HEAD

The magic, and every time I have to do this it does feel a little like magic, is the perl statement. git filter-branch is filtering the index at each commit and prepending all blob paths (i.e. changing the working tree's file paths) with 'newsubdir'. You might have to experiment around to get the paths exactly right. A couple of lessons learned from someone who's walked this path before:

  • Back everything up. git filter-branch is history destructive. Once you change it, you cannot easily change it back. Be sure to back up all the repository copies you're using. Nothing's worse then finishing a complex operation and discovering you missed a / in the path.
  • Script everything. Unless you've got some serious skill; you won't get this right the first time. Script each individual step as you complete it, so that rerunning any of them is easy. Also if you discover a week later you screwed up a flag, you can replicate in moments.
  • Spend $20 on a cluster compute instance in EC2. git filter-branch is enormously CPU intensive. An index-filter on a deep history could take hours to run on your local environment, but a fraction of that time on an AWS cluster compute instance. Sure, they cost a little more than $2 an hour, but you're only going to need one for a few hours. Save yourself pain and use those scripts you wrote on hardware that makes the operation trivial. It costs the price of a nice lunch.