Fast String Hashing Algorithm with low collision rates with 32 bit integer [closed]

Solution 1:

Murmur Hash is pretty nice.

Solution 2:

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

Solution 3:

For a fixed string-set use gperf.

If your string-set changes you have to pick one hash function. That topic has been discussed before:

What's the best hashing algorithm to use on a stl string when using hash_map?

Solution 4:

There's also a nice article at eternallyconfuzzled.com.

Jenkins' One-at-a-Time hash for strings should look something like this:

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
        hash += *s;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

Solution 5:

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

  • Slow construction (needs lookup and possibly memory allocation)
  • Requires global data and synchronization in the case of concurrent threads
  • Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

Cheers,

Carl