How to write a counter function for Vec<String> in Rust? [duplicate]

Function name: my_counter
Input: ['foo', 'bar', 'bar', 'bar', 'bar']]
Output: {'foo': 1, 'bar': 4}

Note: The Output type is HashMap<String, usize>, but not HashMap<&str, usize>.

Here is my implementation which I believe is a little overhead. The 'bar' has been converted to string for four times, but might not needed.

pub fn my_counter(vec: &Vec<String>) -> HashMap<String, usize> {
    let mut result: HashMap<String, usize> = HashMap::new();
    for key in vec.iter() {
        let val = result.entry(key.to_string()).or_insert(0);
        *val += 1;
    }
    result
}

Anybody would like to share a better solution? Thanks very much~


Solution 1:

One thing you could do is avoid creating new Strings by consuming (or moving) the original vector's values. Like this:

pub fn my_counter(vec: Vec<String>) -> HashMap<String, usize> {
    let mut result: HashMap<String, usize> = HashMap::new();
    for key in vec {
        let val = result.entry(key).or_insert(0);
        *val += 1;
    }
    result
}

Note the changed function signature: vec is now Vec<String>, not &Vec<String>. If that, for some reason, is not acceptable, then you really can't avoid creating new String values for the HashMap's keys. Let's say N is the number of vector items and M is the number of unique vector items. If you know you are gonna have lots of duplicate values in the vector (that is, M is much smaller than N), in theory you might be able to get away with creating only M new Strings. However, Rust doesn't seem to provide stable methods to directly achieve that. If using Nightly is OK for you, maybe try raw_entry_mut().

An alternative approach might be to first create a temporary HashMap<&str, usize> and then convert it to the desired "fully owned" HashMap<String, usize>. However, it's likely that would only make things worse. It really depends on the keys you have. It is one thing to have short keys and a ratio of M:N about 0.5-1.0 and completely different if you have long keys and a ratio of, say, 0.0001.

If you have a good idea on the number of unique keys, you can definitely speed things up to some degree by simply creating the HashMap with HashMap::with_capacity(...);. Using an alternative to the default hasher could also help in theory, although I have only tried FnvHashMap and even for short String keys I couldn't get any significant speed-up.

Solution 2:

I can immediately imagine 3 other variants of doing this

  • See first whether the entry already exists with contains_key, then either insert or update with get_mut
  • Blindly just insert, then re-insert if the first insert returned a previous value
  • Build a HashMap<&str, usize> first, then convert that (doing the work of inserting twice).

The problem is, what is faster depends on the input data. How many strings there are, how long they are, how long the input Vec is.

I've benchmarked 5 cases:

  • d1: Your short example input data
  • few10k: 10 unique strings of length 4 in an array with 10000 entries
  • many10k: 10000 entries of length ≤8, nearly all unique
  • long: 300 unique strings of length 300
  • mix: All of the above

Here are the results

test d1_blind         ... bench:         331 ns/iter (+/- 5)
test d1_seeing        ... bench:         234 ns/iter (+/- 2)
test d1_tushushu      ... bench:         230 ns/iter (+/- 4)
test d1_twice         ... bench:         232 ns/iter (+/- 4)
test few10k_blind     ... bench:     727,628 ns/iter (+/- 4,557)
test few10k_seeing    ... bench:     302,445 ns/iter (+/- 3,282)
test few10k_tushushu  ... bench:     399,378 ns/iter (+/- 3,870)
test few10k_twice     ... bench:     173,828 ns/iter (+/- 2,431)
test long_blind       ... bench:      70,604 ns/iter (+/- 233)
test long_seeing      ... bench:      93,349 ns/iter (+/- 381)
test long_tushushu    ... bench:      69,862 ns/iter (+/- 231)
test long_twice       ... bench:      92,118 ns/iter (+/- 292)
test many10k_blind    ... bench:   1,060,656 ns/iter (+/- 65,839)
test many10k_seeing   ... bench:   1,148,671 ns/iter (+/- 58,452)
test many10k_tushushu ... bench:     957,733 ns/iter (+/- 5,392)
test many10k_twice    ... bench:   1,297,526 ns/iter (+/- 57,415)
test mix_blind        ... bench:   1,962,634 ns/iter (+/- 7,973)
test mix_seeing       ... bench:   1,617,921 ns/iter (+/- 45,453)
test mix_tushushu     ... bench:   1,519,876 ns/iter (+/- 4,872)
test mix_twice        ... bench:   1,597,130 ns/iter (+/- 11,068)

It seems your implementation performs best in most cases. You could go further and start mixing implementations, i.e. run twice until its HashMap<&str, usize> gets to a certain size, switch to your implementation. Since this problem appears in many places (e.g. also in sql databases, and those people love optimizing things to death), somebody probably has written a paper about the best method for it somewhere.