How does array_diff work?
Solution 1:
user187291's suggestion to do it in PHP via hash tables is simply great! In a rush of adrenaline taken from this phantastic idea, I even found a way to speed it up a little more (PHP 5.3.1):
function leo_array_diff($a, $b) {
$map = array();
foreach($a as $val) $map[$val] = 1;
foreach($b as $val) unset($map[$val]);
return array_keys($map);
}
With the benchmark taken from user187291's posting:
LEO=0.0322 leo_array_diff()
ME =0.1308 my_array_diff()
YOU=4.5051 your_array_diff()
PHP=45.7114 array_diff()
The array_diff() performance lag is evident even at 100 entries per array.
Note: This solution implies that the elements in the first array are unique (or they will become unique). This is typical for a hash solution.
Note: The solution does not preserve indices. Assign the original index to $map and finally use array_flip() to preserve keys.
function array_diff_pk($a, $b) {
$map = array_flip($a);
foreach($b as $val) unset($map[$val]);
return array_flip($map);
}
PS: I found this while looking for some array_diff() paradoxon: array_diff() took three times longer for practically the same task if used twice in the script.
Solution 2:
UPDATE
see below for faster/better code.
array_diff behaviour is much better in php 5.3.4, but still ~10 times slower than Leo's function.
also it's worth noting that these functions are not strictly equivalent to
array_diff
since they don't maintain array keys, i.e.my_array_diff(x,y) == array_values(array_diff(x,y))
.
/UPDATE
A better solution is to use hash maps
function my_array_diff($a, $b) {
$map = $out = array();
foreach($a as $val) $map[$val] = 1;
foreach($b as $val) if(isset($map[$val])) $map[$val] = 0;
foreach($map as $val => $ok) if($ok) $out[] = $val;
return $out;
}
$a = array('A', 'B', 'C', 'D');
$b = array('X', 'C', 'A', 'Y');
print_r(my_array_diff($a, $b)); // B, D
benchmark
function your_array_diff($arraya, $arrayb)
{
foreach ($arraya as $keya => $valuea)
{
if (in_array($valuea, $arrayb))
{
unset($arraya[$keya]);
}
}
return $arraya;
}
$a = range(1, 10000);
$b = range(5000, 15000);
shuffle($a);
shuffle($b);
$ts = microtime(true);
my_array_diff($a, $b);
printf("ME =%.4f\n", microtime(true) - $ts);
$ts = microtime(true);
your_array_diff($a, $b);
printf("YOU=%.4f\n", microtime(true) - $ts);
result
ME =0.0137
YOU=3.6282
any questions? ;)
and, just for fun,
$ts = microtime(true);
array_diff($a, $b);
printf("PHP=%.4f\n", microtime(true) - $ts);
result
ME =0.0140
YOU=3.6706
PHP=19.5980
that's incredible!
Solution 3:
The best solution to know how it works it to take a look at its source-code ;-)
(Well, that's one of the powers of open source -- and if you see some possible optimization, you can submit a patch ;-) )
For array_diff, it should be in ext/standard
-- which means, for PHP 5.3, it should be there : branches/PHP_5_3/ext/standard
And, then, the array.c
file looks like a plausible target ; the php_array_diff
function, line 3381, seems to correspond to array_diff
.
(Good luck going through the code : it's quite long...)