Can I use a list as a hash in R? If so, why is it so slow?

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the following code will populate a hash with up to 10000 key/value pairs, where the keys are random letters and the values are random integers. Then, it does 10000 random lookups in that hash.

#!/usr/bin/perl -w
use strict;

my @letters = ('a'..'z');

print @letters . "\n";
my %testHash;

for(my $i = 0; $i < 10000; $i++) {
    my $r1 = int(rand(26));
    my $r2 = int(rand(26));
    my $r3 = int(rand(26));
    my $key = $letters[$r1] . $letters[$r2] . $letters[$r3];
    my $value = int(rand(1000));
    $testHash{$key} = $value;
}

my @keyArray = keys(%testHash);
my $keyLen = scalar @keyArray;

for(my $j = 0; $j < 10000; $j++) {
    my $key = $keyArray[int(rand($keyLen))];
    my $lookupValue = $testHash{$key};
    print "key " .  $key . " Lookup $lookupValue \n";
}

Now that increasingly, I am wanting to have a hash-like data structure in R. The following is the equivalent R code:

testHash <- list()

for(i in 1:10000) {
  key.tmp = paste(letters[floor(26*runif(3))], sep="")
  key <- capture.output(cat(key.tmp, sep=""))
  value <- floor(1000*runif(1))
  testHash[[key]] <- value
}

keyArray <- attributes(testHash)$names
keyLen = length(keyArray);

for(j in 1:10000) {
  key <- keyArray[floor(keyLen*runif(1))]
  lookupValue = testHash[[key]]
  print(paste("key", key, "Lookup", lookupValue))
}

The code seem to be doing equivalent things. However, the Perl one is much faster:

>time ./perlHashTest.pl
real    0m4.346s
user    **0m0.110s**
sys 0m0.100s

Comparing to R:

time R CMD BATCH RHashTest.R

real    0m8.210s
user    **0m7.630s**
sys 0m0.200s

What explains the discrepancy? Are lookups in R lists just not good?

Increasing to 100,000 list length and 100,000 lookups only exaggerates the discrepancy? Is there a better alternative for the hash data structure in R than the native list()?


Solution 1:

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

Solution 2:

You could try environments and/or the hash package by Christopher Brown (which happens to use environments under the hood).

Solution 3:

Your code is very un R-like and is one of the reasons it's so slow. I haven't optimized the code below for maximum speed, only R'ness.

n <- 10000

keys <- matrix( sample(letters, 3*n, replace = TRUE), nrow = 3 )
keys <- apply(keys, 2, paste0, collapse = '')
value <- floor(1000*runif(n))
testHash <- as.list(value)
names(testHash) <- keys

keys <- sample(names(testHash), n, replace = TRUE)
lookupValue = testHash[keys]
print(data.frame('key', keys, 'lookup', unlist(lookupValue)))

On my machine that runs almost instantaneously excluding the printing. Your code ran about the same speed you reported. Is it doing what you want? You could set n to 10 and just look at the output and testHash and see if that's it.

NOTE on syntax: The apply above is simply a loop and those are slow in R. The point of those apply family commands is expressiveness. Many of the commands that follow could have been put inside a loop with apply and if it was a for loop that would be the temptation. In R take as much out of your loop as possible. Using apply family commands makes this more natural because the command is designed to represent the application of one function to a list of some sort as opposed to a generic loop (yes, I know apply could be used on more than one command).