Why is the following code correct for computing the hash of a string?
I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
Where:
- char_i_val: is the integer value of the character plus 1 given by
string[i]-'a' + 1
- p is a prime number larger than the character set
- m is a large prime number
The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
I understand what the program is trying to do but I do not understand why it is correct.
My question
I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
Based on the above shouldn't the code be as follows?
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.
Solution 1:
Mathematically, there is no reason to reduce intermediate results modulo m
.
Operationally, there are a couple of very closely related reasons to do it:
- To keep numbers small enough that they can be represented efficiently.
- To keep numbers small enough that operations on them do not overflow.
So let's look at some quantities and see if they need to be reduced.
-
p
was defined as some value less thanm
, sop % m == p
. -
p_pow
andhash_value
have already been reduced modulom
when they were computed, reducing them modulom
again would do nothing. -
char_value
is at most 26, which is already less thanm
. -
char_value * p_pow
is at most26 * (m - 1)
. That can be, and often will be, more thanm
. So reducing it modulom
would do something. But it can still be delayed, because the next step is still "safe" (no overflow) -
char_value * p_pow + hash_value
is still at most27 * (m - 1)
which is still much less than 263-1 (the maximum value for along long
, see below why I assume that along long
is 64-bit), so there is no problem yet. It's fine to reduce modulom
after the addition.
As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce hash_value
modulo m
. That's over 341 million iterations! For most practical purposes you could therefore remove the first % m
and return hash_value % m;
instead.
I used 263-1 in this calculation because p_pow = (p_pow * p) % m
requires long long
to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because p_pow
can be approximately a billion and a 32-bit type cannot hold 31 billion.
BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for char_value
which is bad news because the remainder operator %
in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.