Managing large binary files with Git
I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:
- Copy the binary files by hand.
- Pro: Not sure.
- Contra: I am strongly against this, as it increases the likelihood of errors when setting up a new site/migrating the old one. Builds up another hurdle to take.
- Manage them all with Git.
- Pro: Removes the possibility to 'forget' to copy a important file
- Contra: Bloats the repository and decreases flexibility to manage the code-base and checkouts, clones, etc. will take quite a while.
- Separate repositories.
- Pro: Checking out/cloning the source code is fast as ever, and the images are properly archived in their own repository.
- Contra: Removes the simpleness of having the one and only Git repository on the project. It surely introduces some other things I haven't thought about.
What are your experiences/thoughts regarding this?
Also: Does anybody have experience with multiple Git repositories and managing them in one project?
The files are images for a program which generates PDFs with those files in it. The files will not change very often (as in years), but they are very relevant to a program. The program will not work without the files.
I discovered git-annex recently which I find awesome. It was designed for managing large files efficiently. I use it for my photo/music (etc.) collections. The development of git-annex is very active. The content of the files can be removed from the Git repository, only the tree hierarchy is tracked by Git (through symlinks). However, to get the content of the file, a second step is necessary after pulling/pushing, e.g.:
$ git annex add mybigfile
$ git commit -m'add mybigfile'
$ git push myremote
$ git annex copy --to myremote mybigfile ## This command copies the actual content to myremote
$ git annex drop mybigfile ## Remove content from local repo
...
$ git annex get mybigfile ## Retrieve the content
## or to specify the remote from which to get:
$ git annex copy --from myremote mybigfile
There are many commands available, and there is a great documentation on the website. A package is available on Debian.
If the program won't work without the files it seems like splitting them into a separate repo is a bad idea. We have large test suites that we break into a separate repo but those are truly "auxiliary" files.
However, you may be able to manage the files in a separate repo and then use git-submodule
to pull them into your project in a sane way. So, you'd still have the full history of all your source but, as I understand it, you'd only have the one relevant revision of your images submodule. The git-submodule
facility should help you keep the correct version of the code in line with the correct version of the images.
Here's a good introduction to submodules from Git Book.
Another solution, since April 2015 is Git Large File Storage (LFS) (by GitHub).
It uses git-lfs (see git-lfs.github.com) and tested with a server supporting it: lfs-test-server:
You can store metadata only in the git repo, and the large file elsewhere.
Have a look at git bup which is a Git extension to smartly store large binaries in a Git repository.
You'd want to have it as a submodule, but you won't have to worry about the repository getting hard to handle. One of their sample use cases is storing VM images in Git.
I haven't actually seen better compression rates, but my repositories don't have really large binaries in them.
Your mileage may vary.