Git with large files

Solution 1:

Update 2017:

Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)

GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.

Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.


Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)

Using git-lfs (see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere. Maximum of 2 Gb per commit.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

See git-lfs/wiki/Tutorial:

git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"

Original answer:

Regarding what the git limitations with large files are, you can consider bup (presented in details in GitMinutes #24)

The design of bup highlights the three issues that limits a git repo:

  • huge files (the xdelta for packfile is in memory only, which isn't good with large files)
  • huge number of file, which means, one file per blob, and slow git gc to generate one packfile at a time.
  • huge packfiles, with a packfile index inefficient to retrieve data from the (huge) packfile.

Handling huge files and xdelta

The primary reason git can't handle huge files is that it runs them through xdelta, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space
, and git is well known for its amazingly efficient repository format.

Unfortunately, xdelta works great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.

What bup does instead of xdelta is what we call "hashsplitting."
We wanted a general-purpose way to efficiently back up any large file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.

rollsum seems to do pretty well at its job. You can find it in bupsplit.c.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every 2^13 = 8192 bytes, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.

With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks before and after the affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.

The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be 20/8192 = 0.25% of the file length.
For a 200GB file, that's 488 megs of just sequence data.

We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.

Handling huge numbers of files and git gc

git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say, svn can handle.
But that's not the same kind of "frequently" we're talking about.

The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file (using highly efficient xdelta compression, and ignoring any files that are no longer relevant).

'git gc' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.

bup doesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.

Handling huge repository (meaning huge numbers of huge packfiles)

Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc' usually does eventually.

The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx) files.

each packfile (*.pack) in git has an associated idx (*.idx) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.

The performance of the binary search is about O(log n) with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it to O(log(n)-7).
Unfortunately, this breaks down a bit when you have lots of packs.

To improve performance of this sort of operation, bup introduces midx (pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.

Solution 2:

You really, really, really do not want large binary files checked into your Git repository.

Each update you add will cumulatively add to the overall size of your repository, meaning that down the road your Git repo will take longer and longer to clone and use up more and more disk space, because Git stores the entire history of the branch locally, which means when someone checks out the branch, they don't just have to download the latest version of the database; they'll also have to download every previous version.

If you need to provide large binary files, upload them to some server separately, and then check in a text file with a URL where the developer can download the large binary file. FTP is actually one of the better options, since it's specifically designed for transferring binary files, though HTTP is probably even more straightforward.