Faster algorithm to find unique element between two arrays?
EDIT: For anyone new to this question, I have posted an answer clarifying what was going on. The accepted answer is the one I feel best answers my question as originally posted, but for further details please see my answer.
NOTE: This problem was originally pseudocode and used lists. I have adapted it to Java and arrays. So while I'd love to see any solutions that use Java-specific tricks (or tricks in any language for that matter!), just remember that the original problem is language-independent.
The Problem
Let's say that there are two unsorted integer arrays a
and b
, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element. As an example:
int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};
Design an algorithm that takes as input these two arrays and outputs the single unique integer (in the above case, 7).
The Solution (So Far)
I came up with this:
public static int getUniqueElement(int[] a, int[] b) {
int ret = 0;
for (int i = 0; i < a.length; i++) {
ret ^= a[i];
}
for (int i = 0; i < b.length; i++) {
ret ^= b[i];
}
return ret;
}
The "official" solution presented in class:
public static int getUniqueElement(int[] a, int[] b) {
int ret = 0;
for (int i = 0; i < a.length; i++) {
ret += a[i];
}
for (int i = 0; i < b.length; i++) {
ret -= b[i];
}
return Math.abs(ret);
}
So, both are conceptually doing the same thing. And given that a
is of length m and b
is of length n, then both solutions have running time of O(m + n).
The Question
I later got to talking with my teacher and he hinted that there was an even faster way of doing it. Honestly I don't see how; to find out whether an element is unique it seems you'd have to at least look at every element. At that's at least O(m + n)...right?
So is there a faster way? And if so, what is it?
This is probably the fastest you can do it in Java using HotLick's suggestion in the comments. It makes the assumption that b.length == a.length + 1
so b is the larger array with the extra "unique" element.
public static int getUniqueElement(int[] a, int[] b) {
int ret = 0;
int i;
for (i = 0; i < a.length; i++) {
ret = ret ^ a[i] ^ b[i];
}
return ret ^ b[i];
}
Even if the assumption cannot be made, you can easily expand it to include the case where either a or b can be the larger array with the unique element. It's still O(m+n) though and only loop/assignment overhead is reduced.
Edit:
Due to details of language implementation, this is still (surprisingly) the fastest way to do it in CPython.
def getUniqueElement1(A, B):
ret = 0
for a in A: ret = ret ^ a
for b in B: ret = ret ^ b
return ret
I have tested this with the timeit
module and found some interesting results. It turns out that the longhand ret = ret ^ a
is indeed faster in Python than the shorthand ret ^= a
. Also iterating over the elements of a loop is much much faster than iterating over the indexes and then making subscript operations in Python. That is why this code is much faster than my previous method where I tried to copy Java.
I guess the moral of the story is that there is no correct answer because the question is bogus anyways. As the OP noted in another answer below, it turns out you can't really go any faster than O(m+n) on this and his teacher was just pulling his leg. Thus the problem reduces to finding the fastest way to iterate over all elements in the two arrays and accumulating the XOR of all of them. And this means it's entirely dependent on language implementation, and you have to do some testing and playing around to get the true "fastest" solution in whatever implementation you are using, because the overall algorithm will not change.
Alright here we go...apologies to anyone expecting a faster solution. It turns out my teacher was having a little fun with me and I completely missed the point of what he was saying.
I should begin by clarifying what I meant by:
he hinted that there was an even faster way of doing it
The gist of our conversation was this: he said that my XOR approach was interesting, and we talked for a while about how I arrived at my solution. He asked me whether I thought my solution was optimal. I said I did (for the reasons I mentioned in my question). Then he asked me, "Are you sure?" with a look on his face I can only describe as "smug". I was hesitant but said yeah. He asked me if I could think of a better way to do it. I was pretty much like, "You mean there's a faster way?" but instead of giving me a straight answer he told me to think about it. I said I would.
So I thought about it, sure that my teacher knew something I didn't. And after not coming up with anything for a day, I came here.
What my teacher actually wanted me to do was defend my solution as being optimal, not try to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part is proving it works (and that it's the best). He thought it was quite funny that I spent so much time in Find-A-Better-Way Land instead of working out a simple proof of O(n) that would have taken considerably less time (we ended up doing so, see below if you're interested).
So I guess, big lesson learned here. I'll be accepting Shashank Gupta's answer because I think that it does manage to answer the original question, even though the question was flawed.
I'll leave you guys with a neat little Python one-liner I found while typing the proof. It's not any more efficient but I like it:
def getUniqueElement(a, b):
return reduce(lambda x, y: x^y, a + b)
A Very Informal "Proof"
Let's start with the original two arrays from the question, a
and b
:
int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};
We'll say here that the shorter array has length n
, then the longer array must have length n + 1
. The first step to proving linear complexity is to append the arrays together into a third array (we'll call it c
):
int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4};
which has length 2n + 1
. Why do this? Well, now we have another problem entirely: finding the element that occurs an odd number of times in c
(from here on "odd number of times" and "unique" are taken to mean the same thing). This is actually a pretty popular interview question and is apparently where my teacher got the idea for his problem, so now my question has some practical significance. Hooray!
Let's assume there is an algorithm faster than O(n), such as O(log n). What this means is that it will only access some of the elements of c
. For example, an O(log n) algorithm might only have to check log(13) ~ 4 of the elements in our example array to determine the unique element. Our question is, is this possible?
First let's see if we can get away with removing any of the elements (by "removing" I mean not having to access it). How about if we remove 2 elements, so that our algorithm only checks a subarray of c
with length 2n - 1
? This is still linear complexity, but if we can do that then maybe we can improve upon it even further.
So, let's choose two elements of c
completely at random to remove. There are actually several things that could happen here, which I'll summarize into cases:
// Case 1: Remove two identical elements
{6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4};
// Case 2: Remove the unique element and one other element
{6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4};
// Case 3: Remove two different elements, neither of which are unique
{6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4};
What does our array now look like? In the first case, 7 is still the unique element. In the second case there is a new unique element, 5. And in the third case there are now 3 unique elements...yeah it's a total mess there.
Now our question becomes: can we determine the unique element of c
just by looking at this subarray? In the first case we see that 7 is the unique element of the subarray, but we can't be sure it is also the unique element of c
; the two removed elements could have just as well been 7 and 1. A similar argument applies for the second case. In case 3, with 3 unique elements we have no way of telling which two are non-unique in c
.
It becomes clear that even with 2n - 1
accesses, there is just not enough information to solve the problem. And so the optimal solution is a linear one.
Of course, a real proof would use induction and not use proof-by-example, but I'll leave that to someone else :)