Fixing a file consisting of both UTF-8 and Windows-1252

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?

Solution 1:

Yes!

Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

A line can contain a mix of encodings

Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = fix_latin($bytes);
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

One of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
encoded using iso-8859-1 or cp1252, followed by one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.
One of
[àáâãäåæçèéêëìíîï]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.
One of
[ðñòóôõö÷]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

Each line only uses one encoding

fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

$ perl -e'
   use Encode qw( decode );
   for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X\n", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A

Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

The line is encoded using iso-8859-1 or cp1252,
At least one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
is present in the line,
All instances of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
are always followed by exactly one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of
[àáâãäåæçèéêëìíîï]
are always followed by exactly two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of
[ðñòóôõö÷]
are always followed by exactly three of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
None of
[øùúûüýþÿ]
are present in the line, and
None of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
are present in the line except where previously mentioned.

Notes:

Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.

Solution 2:

This is one of the reasons I wrote Unicode::UTF8. With Unicode::UTF8 this is trivial using the fallback option in Unicode::UTF8::decode_utf8().

use Unicode::UTF8 qw[decode_utf8];
use Encode        qw[decode];

print "UTF-8 mixed with Latin-1 (ISO-8859-1):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
    no warnings 'utf8';
    printf "U+%v04X\n", decode_utf8($octets, sub { $_[0] });
}

print "\nUTF-8 mixed with CP-1252 (Windows-1252):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
    no warnings 'utf8';
    printf "U+%v04X\n", decode_utf8($octets, sub { decode('CP-1252', $_[0]) });
}

Output:

UTF-8 mixed with Latin-1 (ISO-8859-1):
U+00D0.0020.0092.0020.0412.000A
U+0412.000A

UTF-8 mixed with CP-1252 (Windows-1252):
U+00D0.0020.2019.0020.0412.000A
U+0412.000A

Unicode::UTF8 is written in C/XS and only invokes the callback/fallback when encountering an Ill-formed UTF-8 sequence.

Why C++ ranges "transform -> filter" calls transform twice for values that match the filter's predicate?

How can I get Outlook 2010 to show my upcoming appointments from sync'd calendars?

In Windows 8 my mouse pad stops responding (for 1 second) to input after typing on the keyboard

Cannot Get Sound Over HDMI in Windows 7

User's PowerShell Modules Path Location on Linux and macOS?

Is there any way to enable Administrator account from an limited user account in Windows 7?

Unexpected Access Denied error while accessing EFS encrypted file

When does the 'system start' occur if Task Scheduler with the /sc onstart parameter is used?

Can't find a wireless access point's IP address after changing router IP/LAN settings

Remove "auto-corrected" line (---, ***) in OpenOffice.org Writer

How can I know whether a PS2 keyboard can be converted to USB one?