Problem with SVN filename encoding on Mac OS X

I have some filename with some Unicode character in it. All filenames on Mac OS X are UTF8 encoded. Also $LANG is set to en_US.UTF-8.

However, it seems svn has some problems with that:

az@ip212 1054 (Integration) %ls
Abbildungen                           Verbesserungsvorschläge_Applets.odt
AllgemeineAnmerkungen.rtf             Verbesserungsvorschläge_Applets.rtf
Geogebra                              Vorlagen
Texte
az@ip212 1055 (Integration) %svn ls
Abbildungen/
AllgemeineAnmerkungen.rtf
Geogebra/
Texte/
Verbesserungsvorschläge_Applets.rtf
Verbesserungsvorschläge_Applets.odt
Vorlagen/
az@ip212 1056 (Integration) %svn del Verb*.odt
svn: Use --force to override this restriction
svn: 'Verbesserungsvorschläge_Applets.odt' is not under version control
az@ip212 1057 (Integration) %svn status
?       Verbesserungsvorschläge_Applets.odt
!       Verbesserungsvorschläge_Applets.odt
az@ip212 1058 (Integration) %

As you can see, svn del does not recognize the filename. And even svn status gets confused about it.

How can I fix this? I also tried with LC_CTYPE=$LANG LC_ALL=$LANG LC=$LANG but no change.


Solution 1:

I got an answer from the Subversion mailinglist from B Smith-Mannschott:

This is a known issue.

http://subversion.tigris.org/issues/show_bug.cgi?id=2464

One poster on the comment thread to that issue suggested as follows:

Additional comments from Julian Mehnle Thu Aug 6 07:40:30 -0700 2009:

There is a work-around: install the "unicode_path" variant of the subversion MacPorts package:

$ sudo port install subversion +unicode_path

I haven't tried this myself.

// ben

It seems to work mostly for me but I am not sure what else is broken now.

I did some investigation into the Subversion source and it seems that UTF8 filename support is broken very badly. They kind of ignore the fact that a filename can have different representations in UTF8. They handle all such different representations as different filenames. MacOSX might change the representation internally and this is what Subversion confuses a lot -- and cannot handle.

You can see in their source that their path compare function is basically just a memcpy.

I tried to fix it but I am not really sure if I did or not (and I don't want to waste much more time into it -- it seems to work now but not sure about it).

Read the upstream bug report for more details and a follow-up discussion.

Solution 2:

As others have mentioned here and elsewhere, the root cause is as follows: For some characters, UTF-8 allows different ways to encode them (composed vs. decomposed). The file systems on macOS (HFS+ or APFS) encode filenames in the normalized decomposed form (NFD), while Subversion seems to use a different UTF-8 encoding when files are added.

So when a file named ä_¥_é_ç_Ø.txt is added from the command line:

> svn add ä_¥_é_ç_Ø.txt
A       ä_¥_é_ç_Ø.txt

Subversion stores the file name in a different encoding which leads to problems:

> svn status
?       ä_¥_é_ç_Ø.txt
!       ä_¥_é_ç_Ø.txt

The first line is about the existing file (whose name is NFD encoded). This file exists in the file system but is unknown to Subversion ("?").
The second line is about the added file (whose name is encoded differently). This file is known to Subversion but does not exist in the file system ("!")

To see the different encodings, use xxd:

> svn status | head -1 | xxd; echo; svn status | tail -1 | xxd
00000000: 3f20 2020 2020 2020 61cc 885f c2a5 5f65  ?       a.._.._e
00000010: cc81 5f63 cca7 5fc3 982e 7478 740a       .._c.._...txt.

00000000: 2120 2020 2020 2020 c3a4 5fc2 a55f c3a9  !       .._.._..
00000010: 5fc3 a75f c398 2e74 7874 0a              _.._...txt.

Here is how I deal with this to make Subversion work with UTF-8 encoded file names on macOS file systems:

When adding or removing files from Subversion, I do not type or autocomplete the file names in the Subversion command. Instead I ls the file, copy the file name, and paste it into the Subversion command, where it will show up with the encoding's actual hex codes.
Doing this causes Subversion to use the actual file name encoding instead of using a converted form.

Example:

> svn status
?       ä_¥_é_ç_Ø.txt
> ls
ä_¥_é_ç_Ø.txt

Copy the file name and paste it into the following command

> svn add a<0308>_¥_e<0301>_c<0327>_Ø.txt
A         ä_¥_é_ç_Ø.txt
> svn commit -m "Test"
Füge hinzu         ä_¥_é_ç_Ø.txt
Übertrage Daten .erledigt
Übertrage Transaktion...
Revision 4 übertragen.
> svn status
>