std::string in a multi-threaded program
Given that:
1) The C++03 standard does not address the existence of threads in any way
2) The C++03 standard leaves it up to implementations to decide whether std::string
should use Copy-on-Write semantics in its copy-constructor
3) Copy-on-Write semantics often lead to unpredictable behavior in a multi-threaded program
I come to the following, seemingly controversial, conclusion:
You simply cannot safely and portably use std::string in a multi-threaded program
Obviously, no STL data structure is thread-safe. But at least, with std::vector for example, you can simply use mutexes to protect access to the vector. With an std::string implementation that uses COW, you can't even reliably do that without editing the reference counting semantics deep within the vendor implementation.
Real-world example:
In my company, we have a multi-threaded application which has been thoroughly unit-tested and run through Valgrind countless times. The application ran for months with no problems whatsoever. One day, I recompile the application on another version of gcc, and all of a sudden I get random segfaults all the time. Valgrind is now reporting invalid memory accesses deep within libstdc++, in the std::string copy constructor.
So what is the solution? Well, of course, I could typedef std::vector<char>
as a string class - but really, that sucks. I could also wait for C++0x, which I pray will require implementors to forgo COW. Or, (shudder), I could use a custom string class. I personally always rail against developers who implement their own classes when a preexisting library will do fine, but honestly, I need a string class which I can be sure is not using COW semantics; and std::string simply doesn't guarantee that.
Am I right that std::string
simply cannot be used reliably at all in portable, multi-threaded programs? And what is a good workaround?
You cannot safely and portably do anything in a multi-threaded program. There is no such thing as a portable multi-threaded C++ program, precisely because threads throw everything C++ says about order of operations, and the results of modifying any variable, out the window.
There's also nothing in the standard to guarantee that vector
can be used in the way you say. It would be legal to provide a C++ implementation with a threading extension in which, say, any use of a vector outside the thread in which it was initialized results in undefined behavior. The instant you start a second thread, you aren't using standard C++ any more, and you must look to your compiler vendor for what is safe and what is not.
If your vendor provides a threading extension, and also provides a std::string with COW that (therefore) cannot be made thread-safe, then I think for the time being your argument is with your vendor, or with the threading extension, not with the C++ standard. For example, arguably POSIX should have barred COW strings in programs which use pthreads.
You could possibly make it safe by having a single mutex, which you take while doing any string mutation whatsoever, and any reads of a string that's the result of a copy. But you'd probably get crippling contention on that mutex.
You are right. This will be fixed in C++0x. For now you have to rely on your implementation's documentation. For example, recent libstdc++ Versions (GCC) lets you use string objects as if no string object shares its buffer with another one. C++0x forces a library implemetation to protect the user from "hidden sharing".
Given that the standard doesn't say a word about memory models and is completely thread unaware, I'd say you can't definitely assume every implementation will be non-cow so no, you can't
Apart from that, if you know your tools, most of the implementations will use non-cow strings to allow multi-threading.