Sorting std::strings with numbers in them?

I'm currently sorting by the std::string < operator. The problem with it is that:

30 < 9. The 30 shows up before the 9 since 3 < 9, Windows 9x had this issue to. How could I go about sorting them numerically so that "30 Foxes" shows up after "9 dogs". I should also add that I'm using utf 8 encoding.

Thanks


You can create a custom comparison function to use with std::sort. This function would have to check if the string begins with a numeric value. If it does, convert the numeric part of each string to an int using some mechanism like a stringstream. Then compare the two integer values. If the values compare equally, compare the non-numeric part of the strings lexicographically. Otherwise, if the strings don't contain a numeric part, simply compare the two strings lexicographically as normal.

Basically, something like the following (untested) comparison function:

bool is_not_digit(char c)
{
    return !std::isdigit(c);
}

bool numeric_string_compare(const std::string& s1, const std::string& s2)
{
    // handle empty strings...

    std::string::const_iterator it1 = s1.begin(), it2 = s2.begin();

    if (std::isdigit(s1[0]) && std::isdigit(s2[0])) {
        int n1, n2;
        std::stringstream ss(s1);
        ss >> n1;
        ss.clear();
        ss.str(s2);
        ss >> n2;

        if (n1 != n2) return n1 < n2;

        it1 = std::find_if(s1.begin(), s1.end(), is_not_digit);
        it2 = std::find_if(s2.begin(), s2.end(), is_not_digit);
    }

    return std::lexicographical_compare(it1, s1.end(), it2, s2.end());
}

And then...

std::sort(string_array.begin(), string_array.end(), numeric_string_compare);

EDIT: Of course, this algorithm is only useful if you're sorting strings where the numeric portion appears at the beginning of the string. If you're dealing with strings where the numeric portion can appear anywhere in the string, then you need a more sophisticated algorithm. See http://www.davekoelle.com/alphanum.html for more information.


If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details.

Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.


Here's a version that doesn't convert to integer and thus works for long strings of digits regardless of sizeof(int).

#include <cctype>
#include <cstddef>
#include <cstring>
#include <string>

int numcmp(const char *a, const char *aend, const char *b, const char *bend)
{
  for (;;) {
    if (a == aend) {
      if (b == bend)
        return 0;
      return -1;
    }
    if (b == bend)
      return 1;
    if (*a == *b) {
      ++a, ++b;
      continue;
    }
    if (!isdigit((unsigned char) *a) || !isdigit((unsigned char) *b))
      return *a - *b;

    // skip leading zeros in both strings
    while (*a == '0' && ++a != aend)
      ;
    while (*b == '0' && ++b != aend)
      ;

    // skip to end of the consecutive digits
    const char *aa = a;
    while (a != aend && isdigit((unsigned char) *a))
      ++a;
    std::ptrdiff_t alen = a - aa;

    const char *bb = b;
    while (b != bend && isdigit((unsigned char) *b))
      ++b;
    std::ptrdiff_t blen = b - bb;

    if (alen != blen)
      return alen - blen;

    // same number of consecutive digits in both strings
    while (aa != a) {
      if (*aa != *bb)
        return *aa - *bb;
      ++aa, ++bb;
    }
  }
}

int numcmp(const std::string& a, const std::string& b)
{
  return numcmp(a.data(), a.data() + a.size(),
                b.data(), b.data() + b.size());
}

int numcmp(const char *a, const char *b)
{
  return numcmp(a, a + strlen(a), b, b + strlen(b));
}