String concatenation complexity in C++ and Java [duplicate]

Solution 1:

C++ strings are mutable, and pretty much as dynamically sizable as a StringBuffer. Unlike its equivalent in Java, this code wouldn't create a new string each time; it just appends to the current one.

std::string joinWords(std::vector<std::string> const &words) {
    std::string result;
    for (auto &word : words) {
        result += word;
    }
    return result;
}

This runs in linear time if you reserve the size you'll need beforehand. The question is whether looping over the vector to get sizes would be slower than letting the string auto-resize. That, i couldn't tell you. Time it. :)

If you don't want to use std::string itself for some reason (and you should consider it; it's a perfectly respectable class), C++ also has string streams.

#include <sstream>
...

std::string joinWords(std::vector<std::string> const &words) {
    std::ostringstream oss;
    for (auto &word : words) {
        oss << word;
    }
    return oss.str();
}

It's probably not any more efficient than using std::string, but it's a bit more flexible in other cases -- you can stringify just about any primitive type with it, as well as any type that has specified an operator <<(ostream&, its_type&) override.

Solution 2:

This is somewhat tangential to your Question, but relevant nonetheless. (And too big for a comment!!)

On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2).

In Java, the complexity of s1.concat(s2) or s1 + s2 is O(M1 + M2) where M1 and M2 are the respective String lengths. Turning that into the complexity of a sequence of concatenations is difficult in general. However, if you assume N concatenations of Strings of length M, then the complexity is indeed O(M * N * N) which matches what you said in the Question.

Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).

In the StringBuilder case, the amortized complexity of N calls to sb.append(s) for strings of size M is O(M*N). The key word here is amortized. When you append characters to a StringBuilder, the implementation may need to expand its internal array. But the expansion strategy is to double the array's size. And if you do the math, you will see that on average each character in the buffer is going to be copied one extra time during the entire sequence of append calls. So entire sequence still works out as O(M*N) ... and, as it happens M*N is the total string length.

So your end result is correct, but your statement about the complexity of a single call to append is not correct. (I understand what you mean, but the way you say it is facially incorrect.)

Finally, I'd note that in Java you should use StringBuilder rather than StringBuffer unless you need the buffer to be thread-safe.

Solution 3:

As an example of a really simple structure that has O(n) complexity in C++11:

template<typename TChar>
struct StringAppender {
  std::vector<std::basic_string<TChar>> buff;
  StringAppender& operator+=( std::basic_string<TChar> v ) {
    buff.push_back(std::move(v));
    return *this;
  }
  explicit operator std::basic_string<TChar>() {
    std::basic_string<TChar> retval;
    std::size_t total = 0;
    for( auto&& s:buff )
      total+=s.size();
    retval.reserve(total+1);
    for( auto&& s:buff )
      retval += std::move(s);
    return retval;
  }
};

use:

StringAppender<char> append;
append += s1;
append += s2;
std::string s3 = append;

This takes O(n), where n is the number of characters.

Finally, if you know how long all of the strings are, just doing a reserve with enough room makes append or += take a total of O(n) time. But I agree that is awkward.

Use of std::move with the above StringAppender (ie, sa += std::move(s1)) will significantly increase performance for non-short strings (or using it with xvalues etc)

I do not know the complexity of std::ostringstream, but ostringstream is for pretty printing formatted output, or cases where high performance is not important. I mean, they aren't bad, and they may even out perform scripted/interpreted/bytecode languages, but if you are in a rush, you need something else.

As usual, you need to profile, because constant factors are important.

A rvalue-reference-to-this operator+ might also be a good one, but few compilers implement rvalue references to this.