How is std::string implemented?

Solution 1:

Virtually every compiler I've used provides source code for the runtime - so whether you're using GCC or MSVC or whatever, you have the capability to look at the implementation. However, a large part or all of std::string will be implemented as template code, which can make for very difficult reading.

Scott Meyer's book, Effective STL, has a chapter on std::string implementations that's a decent overview of the common variations: "Item 15: Be aware of variations in string implementations".

He talks about 4 variations:

  • several variations on a ref-counted implementation (commonly known as copy on write) - when a string object is copied unchanged, the refcount is incremented but the actual string data is not. Both object point to the same refcounted data until one of the objects modifies it, causing a 'copy on write' of the data. The variations are in where things like the refcount, locks etc are stored.

  • a "short string optimization" (SSO) implementation. In this variant, the object contains the usual pointer to data, length, size of the dynamically allocated buffer, etc. But if the string is short enough, it will use that area to hold the string instead of dynamically allocating a buffer

Also, Herb Sutter's "More Exceptional C++" has an appendix (Appendix A: "Optimizations that aren't (in a Multithreaded World)") that discusses why copy on write refcounted implementations often have performance problems in multithreaded applications due to synchronization issues. That article is also available online (but I'm not sure if it's exactly the same as what's in the book):

  • http://www.gotw.ca/publications/optimizations.htm

Both those chapters would be worthwhile reading.

Solution 2:

std::string is a class that wraps around some kind of internal buffer and provides methods for manipulating that buffer.

A string in C is just an array of characters

Explaining all the nuances of how std::string works here would take too long. Maybe have a look at the gcc source code http://gcc.gnu.org to see exactly how they do it.

Solution 3:

There's an example implementation in an answer on this page.

In addition, you can look at gcc's implementation, assuming you have gcc installed. If not, you can access their source code via SVN. Most of std::string is implemented by basic_string, so start there.

Another possible source of info is Watcom's compiler

Solution 4:

The c++ solution for strings are quite different from the c-version. The first and most important difference is while the c using the ASCIIZ solution, the std::string and std::wstring are using two iterators (pointers) to store the actual string. The basic usage of the string classes provides a dynamic allocated solution, so in the cost of CPU overhead with the dynamic memory handling it makes the string handling more comfortable.

As you probably already know, the C doesn't contain any built-in generic string type, only provides couple of string operations through the standard library. One of the major difference between C and C++ that the C++ provides a wrapped functionality, so it can be considered as a faked generic type.

In C you need to walk through the string if you would like to know the length of it, the std::string::size() member function is only one instruction (end - begin) basically. You can safely append strings one to an other as long as you have memory, so there is no need to worry about the buffer overflow bugs (and therefore the exploits), because the appending creates a bigger buffer if it is needed.

As somebody told here before, the string is derivated from the vector functionality, in a templated way, so it makes easier to deal with the multibyte-character systems. You can define your own string type using the typedef std::basic_string specific_str_t; expression with any arbitary data type in the template parameter.

I think there are enough pros and contras both side:

C++ string Pros: - Faster iteration in certain cases (using the size definitely, and it doesn't need the data from the memory to check if you are at the end of the string, comparing two pointers. that could make a difference with the caching) - The buffer operation are packed with the string functionality, so less worries about the buffer problems.

C++ string Cons: - due to the dynamic memory allocation stuff, the basic usage could cause impact on the performance. (fortunately you can tell to the string object what should be the original buffer size, so unless you are exceed it, it won't allocate dynamic blocks from the memory) - often weird and inconsistent names compared to other languages. this is the bad thing about any stl stuff, but you can use to it, and it makes a bit specific C++ish feeling. - the heavy usage of the templating forces the standard library to use header based solutions so it is a big impact on the compiling time.

Solution 5:

That depends on the standard library you use.

STLPort for example is a C++ Standard Library implementation which implements strings among other things.