Can I make git recognize a UTF-16 file as text?
I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.
Can git be taught to recognize that this file is text and handle it appropriately?
I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.
I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:
$ git config --global diff.tool vimdiff # or merge.tool to get merging too!
$ git difftool commit1 commit2
git difftool
takes the same arguments as git diff
would, but runs a diff program of your choice instead of the built-in GNU diff
. So pick a multibyte-aware diff (in my case, vim
in diff mode) and just use git difftool
instead of git diff
.
Find "difftool" too long to type? No problem:
$ git config --global alias.dt difftool
$ git dt commit1 commit2
Git rocks.
There is a very simple solution that works out of the box on Unices.
For example, with Apple's .strings
files just:
-
Create a
.gitattributes
file in the root of your repository with:*.strings diff=localizablestrings
-
Add the following to your
~/.gitconfig
file:[diff "localizablestrings"] textconv = "iconv -f utf-16 -t utf-8"
Source: Diff .strings files in Git (and older post from 2010).
Have you tried setting your .gitattributes
to treat it as a text file?
e.g.:
*.vmc diff
More details at http://www.git-scm.com/docs/gitattributes.html.
By default, it looks like git
won't work well with UTF-16; for such a file you have to make sure that no CRLF
processing is done on it, but you want diff
and merge
to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).
But looking at the .gitattributes
manpage, here is the custom attribute that is binary
:
[attr]binary -diff -crlf
So it seems to me that you could define a custom attribute in your top level .gitattributes
for utf16
(note that I add merge here to be sure it is treated as text):
[attr]utf16 diff merge -crlf
From there you would be able to specify in any .gitattributes
file something like:
*.vmc utf16
Also note that you should still be able to diff
a file, even if git
thinks it's binary with:
git diff --text
Edit
This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git
use a different tool to see differences (via --ext-diff
), that answer suggests Guiffy.
But what you likely need is just to diff
a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff
and the following shell script:
#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")
Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.
As for the output to the terminal when looking at a diff of a UTF-16 file:
Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.
GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).
git recently has begun to understand encodings such as utf16.
See gitattributes docs, search for working-tree-encoding
[Make sure your man page matches since this is quite new!]
If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes
file
*.vmc text working-tree-encoding=UTF-16LE eol=CRLF
If UTF-16 (with bom) on *nix make it:
*.vmc text working-tree-encoding=UTF-16-BOM eol=LF
(Replace *.vmc
with *.whatever
for whatever
type files you need to handle)
See: Support working-tree-encoding "UTF-16LE-BOM".
Added later
Following @Hackslash, one may find that this is insufficient
*.vmc text working-tree...
To get nice text-diffs you need
*.vmc diff working-tree...
Putting both works as well
*.vmc text diff working-tree...
But it's arguably
- Redundant —
eol=...
impliestext
- Verbose — a large project could easily have dozens of different text file types
The Problem
Git has a macro-attribute binary
which means -text -diff
. The opposite +text +diff
is not available built-in but git gives the tools (I think!) for synthesizing it
The solution
Git allows one to define new macro attributes.
I'd propose that top of the .gitattributes
file you have
[attr]textfile text diff
Then for all paths that need to be text and diff do
path textfile working-tree-encoding= eol=...
Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.
Most lines should look like
*.c textfile
*.py textfile
Etc
Why not just use diff?
Practical: In most cases we want native eol. Which means no eol=...
. So text
won't get implied and needs to be put explicitly.
Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.
Disclaimer
Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.