What is the best way to find common elements from 2 sets?

Recently I had an interview and I was asked one question.

I have 2 sets with around 1 Million records each. I have to find the common element in 2 sets.

My response:

I will create a new empty Set. And i gave him below solution but he was not happy with it. He said there are 1 million records so the solution won't be good.

public Set<Integer> commonElements(Set<Integer> s1, Set<Integer> s2) {
    Set<Integer> res = new HashSet<>();
     for (Integer temp : s1) {
        if(s2.contains(temp)) {
            res.add(temp);
        }
     }
     return res;
}

What is the better way to solve this problem then?


First of all: in order determine the intersection of two sets, you absolutely have to look at all entries of at least one of the two sets (to figure whether it is in the other set). There is no magic around that would tell you that in less than O(min(size(s1), size(s2)). Period.

The next thing to tell the interviewer: "1 million entries. You must be kidding. It is 2019. Any decent piece of hardware crunches two 1-million sets in less than a second". (Of course: that only applies for objects that are cheap to compare, like here for Integer instances. If oneRecord.equals(anotherRecord) is a super expensive operation, then 1 million entries could still be a problem in 2022).

Then you briefly mention that there are various built-in ways to solve this, as well as various 3rd party libraries. But you avoid the mistake that the other two answers make: pointing to a library that does compute the intersect is not at all something you sell as "solution" to this question.

You see, regarding coding: the java Set interface has an easy solution to that: s1.retainAll(s2) computes the join of the two sets, as it removes all elements from s1 that aren't in s2.

Obviously, you have to mention within the interview that this will modify s1.

In case that the requirement is to not modify s1 or s2, your solution is a viable way to go, and there isn't anything one can do about the runtime cost. If it all, you could call size() for both sets and iterate the one that has less entries.

Alternatively, you can do

Set<String> result = new HashSet<>(s1);
return result.retain(s2);

but in the end, you have to iterate one set and for each element determine whether it is in the second set.

But of course, the real answer to such questions is always always always to show the interviewer that you are able to dissect the problem into its different aspects. You outline basic constraints, you outline different solutions and discuss their pros and cons. Me for example, I would expect you to sit down and maybe write a program like this:

public class Numbers {    
    private final static int numberOfEntries = 20_000_000;
    private final static int maxRandom = numberOfEntries;

    private Set<Integer> s1;
    private Set<Integer> s2;

    @Before
    public void setUp() throws Exception {
        Random random = new Random(42);
        s1 = fillWithRandomEntries(random, numberOfEntries);
        s2 = fillWithRandomEntries(random, numberOfEntries);
    }

    private static Set<Integer> fillWithRandomEntries(Random random, int entries) {
        Set<Integer> rv = new HashSet<>();
        for (int i = 0; i < entries; i++) {
            rv.add(random.nextInt(maxRandom));
        }
        return rv;
    }

    @Test
    public void classic() {
        long start = System.currentTimeMillis();
        HashSet<Integer> intersection = new HashSet<>();
          s1.forEach((i) -> {
           if (s2.contains(i))
             intersection.add(i);
        });
        long end = System.currentTimeMillis();
        System.out.println("foreach duration: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }


    @Test
    public void retainAll() {
        long start = System.currentTimeMillis();
        s1.retainAll(s2);
        long end = System.currentTimeMillis();
        System.out.println("Retain all duration: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + s1.size());
    }

    @Test
    public void streams() {
        long start = System.currentTimeMillis();
        Set<Integer> intersection = s1.stream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
        long end = System.currentTimeMillis();
        System.out.println("streaming: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }

    @Test
    public void parallelStreams() {
        long start = System.currentTimeMillis();
        Set<Integer> intersection = s1.parallelStream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
        long end = System.currentTimeMillis();
        System.out.println("parallel streaming: " + (end-start) + " ms");
        System.out.println("intersection.size() = " + intersection.size());
    }
}

The first observation here: I decided to run with 20 million entries. I started with 2 million, but all three tests would run well below 500 ms. Here is the print out for 20 million on my Mac Book Pro:

foreach duration: 9304 ms
intersection.size() = 7990888 
streaming: 9356 ms
intersection.size() = 7990888
Retain all duration: 685 ms
intersection.size() = 7990888
parallel streaming: 6998 ms
intersection.size() = 7990888

As expected: all intersects have the same size (because I seeded the random number generator to get to comparable results).

And surprise: modifying s1 in place ... is by far the cheapest option. It beats streaming by a factor of 10. Also note: the parallel streaming is quicker here. When running with 1 million entries, the sequential stream was faster.

Therefore I initially mentioned to mention "1 million entries is not a performance problem". That is a very important statement, as it tells the interviewer that you are not one of those people wasting hours to micro-optimize non-existing performance issues.


you can use

CollectionUtils

its from apache

CollectionUtils.intersection(Collection a,Collection b)