Convert between std::u8string and std::string
C++20 added char8_t
and std::u8string
for UTF-8. However, there is no UTF-8 version of std::cout
and OS APIs mostly expect char
and execution character set. So we still need a way to convert between UTF-8 and execution character set.
I was rereading a char8_t paper and it looks like the only way to convert between UTF-8 and ECS is to use std::c8rtomb
and std::mbrtoc8
functions. However, their API is extremely confusing. Can someone provide an example code?
UTF-8 "support" in C++20 seems to be a bad joke.
The only UTF functionality in the Standard Library is support for strings and string_views (std::u8string, std::u8string_view, std::u16string, ...). That is all. There is no Standard Library support for UTF coding in regular expressions, formatting, file i/o and so on.
In C++17 you can--at least--easily treat any UTF-8 data as 'char' data, which makes usage of std::regex, std::fstream, std::cout, etc. possible without loss of performance.
In C++20 things will change. You cannot longer write for example std::string text = u8"...";
It will be impossible to write something like
std::u8fstream file; std::u8string line; ... file << line;
since there is no std::u8fstream.
Even the new C++20 std::format does not support UTF at all, because all necessary overloads are simply missing. You cannot write
std::u8string text = std::format(u8"...{}...", 42);
To make matters worse, there is no simple casting (or conversion) between std::string and std::u8string (or even between const char* and const char8_t*). So if you want to format (using std::format) or input/output (std::cin, std::cout, std::fstream, ...) UTF-8 data, you have to internally copy all strings. - That will be an unnecessary performance killer.
Finally, what use will UTF have without input, output, and formatting?
At present, std::c8rtomb
and std::mbrtoc8
are the the only interfaces provided by the standard that enable conversion between the execution encoding and UTF-8. The interfaces are awkward. They were designed to match pre-existing interfaces like std::c16rtomb
and std::mbrtoc16
. The wording added to the C++ standard for these new interfaces intentionally matches the wording in the C standard for the pre-existing related functions (hopefully these new functions will eventually be added to C; I still need to pursue that). The intent in matching the C standard wording, as confusing as it is, is to ensure that anyone familiar with the C wording recognizes that the char8_t
interfaces work the same way.
cppreference.com has some examples for the UTF-16 versions of these functions that should be useful for understanding the char8_t
variants.
- https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16
- https://en.cppreference.com/w/cpp/string/multibyte/c16rtomb
The common answer given from C++ authorities at the yearly CppCon convention (like in 2018 and 2019) was that should you pick your own UTF8 library to do so. There are all kinds of flavours just pick the one you like. There is still embarrassing little understanding and support for unicode on the C++ side.
Some people hope there will be something in C++23 but we don't even have an official working group so far.