Correctly reading a utf-16 text file into a string without external libraries?

Solution 1:

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:

#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
    // open as a byte stream
    std::wifstream fin("text.txt", std::ios::binary);
    // apply BOM-sensitive UTF-16 facet
    fin.imbue(std::locale(fin.getloc(),
       new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    // read     
    for(wchar_t c; fin.get(c); )
            std::cout << std::showbase << std::hex << c << '\n';
}

Solution 2:

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

Solution 3:

Edit:

So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.



The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.


You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.

Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)

  1. codecvt<char32_t,char,mbstate_t>
  2. codecvt<char16_t,char,mbstate_t>
  3. codecvt_utf8
  4. codecvt_utf16
  5. codecvt_utf8_utf16
  6. c32rtomb/mbrtoc32
  7. c16rtomb/mbrtoc16

And what each one does

  1. A codecvt facet that always converts between UTF-8 and UTF-32
  2. converts between UTF-8 and UTF-16
  3. converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
  4. converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
  5. converts between UTF-8 and UTF-16
  6. If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
  7. If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16

If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).

Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.


So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.

This should build and run anywhere, but makes a bunch of assumptions to actually work:

#include <fstream>
#include <sstream>
#include <iostream>

int main ()
{
    std::stringstream ss;
    std::ifstream fin("filename");
    ss << fin.rdbuf(); // dump file contents into a stringstream
    std::string const &s = ss.str();
    if (s.size()%sizeof(wchar_t) != 0)
    {
        std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
        return 1;
    }
    std::wstring ws;
    ws.resize(s.size()/sizeof(wchar_t));
    std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}

You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.