Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.


I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.


There is a very simple solution that works out of the box on Unices.

For example, with Apple's .strings files just:

  1. Create a .gitattributes file in the root of your repository with:

     *.strings diff=localizablestrings
    
  2. Add the following to your ~/.gitconfig file:

     [diff "localizablestrings"]
     textconv = "iconv -f utf-16 -t utf-8"
    

Source: Diff .strings files in Git (and older post from 2010).


Have you tried setting your .gitattributes to treat it as a text file?

e.g.:

*.vmc diff

More details at http://www.git-scm.com/docs/gitattributes.html.


By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).


git recently has begun to understand encodings such as utf16. See gitattributes docs, search for working-tree-encoding

[Make sure your man page matches since this is quite new!]

If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes file

*.vmc text working-tree-encoding=UTF-16LE eol=CRLF

If UTF-16 (with bom) on *nix make it:

*.vmc text working-tree-encoding=UTF-16-BOM eol=LF

(Replace *.vmc with *.whatever for whatever type files you need to handle)

See: Support working-tree-encoding "UTF-16LE-BOM".


Added later

Following @Hackslash, one may find that this is insufficient

 *.vmc text working-tree... 

To get nice text-diffs you need

 *.vmc diff working-tree...

Putting both works as well

 *.vmc text diff working-tree... 

But it's arguably

  • Redundant — eol=... implies text
  • Verbose — a large project could easily have dozens of different text file types

The Problem

Git has a macro-attribute binary which means -text -diff. The opposite +text +diff is not available built-in but git gives the tools (I think!) for synthesizing it

The solution

Git allows one to define new macro attributes.

I'd propose that top of the .gitattributes file you have

 [attr]textfile text diff

Then for all paths that need to be text and diff do

 path textfile working-tree-encoding= eol=...

Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.

Most lines should look like

*.c textfile
*.py textfile
Etc

Why not just use diff?

Practical: In most cases we want native eol. Which means no eol=... . So text won't get implied and needs to be put explicitly.

Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.

Disclaimer

Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.