How to perform better document version control on Excel files and SQL schema files

The answer I have written here can be applied in this case. A tool called xls2txt can provide human-readable output from .xls files. So in short, you should put this to your .gitattributes file:

*.xls diff=xls

And in the .git/config:

[diff "xls"]
    binary = true
    textconv = /path/to/xls2txt

Of course, I'm sure you can find similar tools for other file types as well, making git diff a very useful tool for office documents. This is what I currently have in my global .gitconfig:

[diff "xls"]
    binary = true
    textconv = /usr/bin/py_xls2txt
[diff "pdf"]
    binary = true
    textconv = /usr/bin/pdf2txt
[diff "doc"]
    binary = true
    textconv = /usr/bin/catdoc
[diff "docx"]
    binary = true
    textconv = /usr/bin/docx2txt

The Pro Git book has a good chapter on the subject: 8.2 Customizing Git - Git Attributes

Since you've tagged your question with git I assume you are asking about Git usage for this.

Well, SQL dumps are normal text files so it makes perfect sense to track them with Git. Just create a repository and store them in it. When you get a new version of a file, simply overwrite it and commit, Git will figure out everything for you, and you'll be able to see modification dates, checkout specific versions of this file and compare different versions.

The same is true for .xlsx if you decompress them. .xlsx files are zipped up directories of XML files (See How to properly assemble a valid xlsx file from its internal sub-components?). Git will view them as binary unless decompressed. It is possible to unzip the .xlsx and track the changes to the individual XML files inside of the archive.

You could also do this with .xls files, but the problem here is that .xls format is binary, so you can't get meaningful diffs from it. But you'll still be able to see modification history and checkout specific versions.

I've been struggling with this exact problem for the last few days and have written a small .NET utility to extract and normalise Excel files in such a way that they're much easier to store in source control. I've published the executable here:

https://bitbucket.org/htilabs/ooxmlunpack/downloads/OoXmlUnpack.exe

..and the source here:

https://bitbucket.org/htilabs/ooxmlunpack

If there's any interest I'm happy to make this more configurable, but at the moment, you should put the executable in a folder (e.g. the root of your source repository) and when you run it, it will:

Scan the folder and its subfolders for any .xlsx and .xlsm files
Take a copy of the file as *.orig.
Unzip each file and re-zip it with no compression.
Pretty-print any files in the archive which are valid XML.
Delete the calcchain.xml file from the archive (since it changes a lot and doesn't affect the content of the file).
Inline any unformatted text values (otherwise these are kept in a lookup table which causes big changes in the internal XML if even a single cell is modified).
Delete the values from any cells which contain formulas (since they can just be calculated when the sheet is next opened).
Create a subfolder *.extracted, containing the extracted zip archive contents.

Clearly not all of these things are necessary, but the end result is a spreadsheet file that will still open in Excel, but which is much more amenable to diffing and incremental compression. Also, storing the extracted files as well makes it much more obvious in the version history what changes have been applied in each version.

If there's any appetite out there, I'm happy to make the tool more configurable since I guess not everyone will want the contents extracted, or possibly the values removed from formula cells, but these are both very useful to me at the moment.

In tests, a 2 MB spreadsheet 'unpacks' to 21 MB, but then I was able to store five versions of it with small changes between each, in a 1.9 MB Mercurial data file, and visualise the differences between versions effectively using Beyond Compare in text mode.

NB: although I'm using Mercurial, I read this question while researching my solution and there's nothing Mercurial-specific about the solution, should work fine for Git or any other VCS.

Tante recommended a very simple approach in Managing ZIP-based file formats in Git:

Open your ~/.gitconfig file (create if not existing already) and add the following stanza:
[diff "zip"]
textconv = unzip -c -a

Use the open document extension .fods. It's a plain, uncompressed XML markup format that both Excel and LibreOffice can open, and the diffs will look good.

How to perform better document version control on Excel files and SQL schema files

Related

Recent Posts