c++11 regex slower than python
hi i would like to understand why the following code which does a split string split using regex
#include<regex>
#include<vector>
#include<string>
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +");
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
auto res = std::vector<std::string>(rit, rend);
return res;
}
int main(){
for(auto i=0; i< 10000; ++i)
split("a b c", " ");
return 0;
}
is slower then the following python code
import re
for i in range(10000):
re.split(' +', 'a b c')
here's
> python test.py 0.05s user 0.01s system 94% cpu 0.070 total
./test 0.26s user 0.00s system 99% cpu 0.296 total
Im using clang++ on osx.
compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total
Notice
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I've augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
real 0m2.038s
user 0m2.009s
sys 0m0.024s
Here's an equivalent of your code, just a bit prettier:
#include <regex>
#include <vector>
#include <string>
std::vector<std::string> split(const std::string &s, const std::regex &r)
{
return {
std::sregex_token_iterator(s.begin(), s.end(), r, -1),
std::sregex_token_iterator()
};
}
int main()
{
const std::regex r(" +");
for(auto i=0; i < 1000000; ++i)
split("a b c", r);
return 0;
}
Timing:
real 0m5.786s
user 0m5.779s
sys 0m0.005s
This is an optimization to avoid construction/allocation of vector and string objects:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
auto rend = std::sregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
Timing:
real 0m3.034s
user 0m3.029s
sys 0m0.004s
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear()
, the vector maintains the memory and construct strings in-place.
Another performance increase would be to avoid construction/destruction std::string
completely, and hence, allocation/deallocation of its objects.
This is a tentative in this direction:
#include <regex>
#include <vector>
#include <string>
void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
Timing:
real 0m2.509s
user 0m2.503s
sys 0m0.004s
An ultimate improvement would be to have a std::vector
of const char *
as return, where each char pointer would point to a substring inside the original s
c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref
in a later sample).
This last improvement could also be achieved with this:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v); // the constant string("a b c") should be optimized
// by the compiler. I got the same performance as
// if it was an object outside the loop
return 0;
}
I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
Boost.Regex
This is the boost::regex
timing for the c string arguments sample:
real 0m1.284s
user 0m1.278s
sys 0m0.005s
Same code, boost::regex
and std::regex
interface in this sample are identical, just needed to change the namespace and include.
Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
EDIT
For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v
version in anything:
#include <regex>
#include <vector>
#include <string>
template<typename Iterator> class intrusive_substring
{
private:
Iterator begin_, end_;
public:
intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}
Iterator begin() {return begin_;}
Iterator end() {return end_;}
};
using intrusive_char_substring = intrusive_substring<const char *>;
void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear(); // This can potentially be optimized away by the compiler because
// the intrusive_char_substring destructor does nothing, so
// resetting the internal size is the only thing to be done.
// Formerly allocated memory is maintained.
while(rit != rend)
{
v.emplace_back(rit->first, rit->second);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<intrusive_char_substring> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
This has to do with the array_ref and string_ref proposal. Here's a sample code using it:
#include <regex>
#include <vector>
#include <string>
#include <string_ref>
void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.emplace_back(rit->first, rit->length());
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string_ref> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
It will also be cheaper to return a vector of string_ref
rather than string
copies for the case of split
with vector return.
EDIT 2
This new solution is able to get output by return. I have used Marshall Clow's string_view
(string_ref
got renamed) libc++ implementation found at https://github.com/mclow/string_view.
#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>
using namespace std;
using namespace std::experimental;
using namespace boost;
string_view stringfier(const cregex_token_iterator::value_type &match) {
return {match.first, static_cast<size_t>(match.length())};
}
using string_view_iterator =
transform_iterator<decltype(&stringfier), cregex_token_iterator>;
iterator_range<string_view_iterator> split(string_view s, const regex &r) {
return {
string_view_iterator(
cregex_token_iterator(s.begin(), s.end(), r, -1),
stringfier
),
string_view_iterator()
};
}
int main() {
const regex r(" +");
for (size_t i = 0; i < 1000000; ++i) {
split("a b c", r);
}
}
Timing:
real 0m0.385s
user 0m0.385s
sys 0m0.000s
Note how faster this is compared to previous results. Of course, it's not filling a vector
inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for
, or even use it to fill a vector
.
As ranging over the iterator_range
creates string_view
s over an original string
(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.
Just to compare using this split
implementation but actually filling a vector
we could do this:
int main() {
const regex r(" +");
vector<string_view> v;
v.reserve(10);
for (size_t i = 0; i < 1000000; ++i) {
copy(split("a b c", r), back_inserter(v));
v.clear();
}
}
This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
real 0m1.002s
user 0m0.997s
sys 0m0.004s
As can be seen, no much difference in comparison with the optimized string_view
output param version.
Note also there's a proposal for a std::split
that would work like this.
For optimizations, in general, you want to avoid two things:
- burning away CPU cycles for unnecessary stuff
- waiting idly for something to happen (memory read, disk read, network read, ...)
The two can be antithetic as sometimes it ends up being faster computing something than caching all of it in memory... so it's a game of balance.
Let's analyze your code:
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +"); // only computed once
// search for first occurrence of rsplit
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
// simultaneously:
// - parses "s" from the second to the past the last occurrence
// - allocates one `std::string` for each match... at least! (there may be a copy)
// - allocates space in the `std::vector`, possibly multiple times
auto res = std::vector<std::string>(rit, rend);
return res;
}
Can we do better ? Well, if we could reuse existing storage instead of keeping allocating and deallocating memory, we should see a significant improvement [1]:
// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
static const std::regex rsplit(" +"); // only computed once
auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::cregex_token_iterator();
size_t pos = 0;
// As long as possible, reuse the existing strings (in place)
for (size_t max = result.size();
rit != rend && pos != max;
++rit, ++pos)
{
result[pos].assign(rit->first, rit->second);
}
// When more matches than existing strings, extend capacity
for (; rit != rend; ++rit, ++pos) {
result.emplace_back(rit->first, rit->second);
}
return pos;
} // split
In the test that you perform, where the number of submatches is constant across iterations, this version is unlikely to be beaten: it will only allocate memory on the first run (both for rsplit
and result
) and then keep reusing existing memory.
[1]: Disclaimer, I have only proved this code to be correct I have not tested it (as Donald Knuth would say).