Interview question: remove duplicates from an unsorted linked list

If you give a person a fish, they eat for a day. If you teach a person to fish...

My measures for the quality of an implementation are:

  • Correctness: If you aren't getting the right answer in all cases, then it isn't ready
  • Readability/maintainability: Look at code repetition, understandable names, the number of lines of code per block/method (and the number of things each block does), and how difficult it is to trace the flow of your code. Look at any number of books focused on refactoring, programming best-practices, coding standards, etc, if you want more information on this.
  • Theoretical performance (worst-case and ammortized): Big-O is a metric you can use. CPU and memory consumption should both be measured
  • Complexity: Estimate how it would take an average professional programmer to implement (if they already know the algorithm). See if that is in line with how difficult the problem actually is

As for your implementation:

  • Correctness: I suggest writing unit tests to determine this for yourself and/or debugging it (on paper) from start to finish with interesting sample/edge cases. Null, one item, two items, various numbers of duplicates, etc
  • Readability/maintainability: It looks mostly fine, though your last two comments don't add anything. It is a bit more obvious what your code does than the code in the book
  • Performance: I believe both are N-squared. Whether the amortized cost is lower on one or the other I'll let you figure out :)
  • Time to implement: An average professional should be able to code this algorithm in their sleep, so looking good

There's not much of a difference. If I've done my math right your's is on average N/16 slower than the authors but pleanty of cases exist where your implementation will be faster.

Edit:

I'll call your implementation Y and the author's A

Both proposed solutions has O(N^2) as worst case and they both have a best case of O(N) when all elements are the same value.

EDIT: This is a complete rewrite. Inspired by the debat in the comments I tried to find the average case for random N random numbers. That is a sequence with a random size and a random distribution. What would the average case be.

Y will always run U times where U is the number of unique numbers. For each iteration it will do N-X comparisons where X is the number of elements removed prior to the iteration (+1). The first time no element will have been removed and on average on the second iteration N/U will have been removed.

That is on average ½N will been left to iterate. We can express the average cost as U*½N. The average U can be expressed based on N as well 0

Expressing A becomes more difficult. Let's say we use I iterations before we've encountered all unique values. After that will run between 1 and U comparisons (on average that's U/") and will do that N-I times.

I*c+U/2(N-I)

but whats the average number of comparisons (c) we run for the first I iterations. on average we need to compare against half of the elements already visited and on average we've visited I/2 elements, Ie. c=I/4

I/4+U/2(N-I).

I can be expressed in terms of N. On average we'll need to visited half on N to find the unique values so I=N/2 yielding an average of

(I^2)/4+U/2(N-I) which can be reduced to (3*N^2)/16.

That is of course if my estimation of the averages are correct. That is on average for any potential sequence A has N/16 fewer comparisons than Y but pleanty of cases exists where Y is faster than A. So I'd say they are equal when compared to the number of comparisons


How about using a HashMap? This way it will take O(n) time and O(n) space. I will write psuedocode.

function removeDup(LinkedList list){
  HashMap map = new HashMap();
  for(i=0; i<list.length;i++)
      if list.get(i) not in map
        map.add(list.get(i))
      else
        list.remove(i)
      end
  end
end

Of course we assume that HashMap has O(1) read and write.

Another solution is to use a mergesort and removes duplicate from start to end of the list. This takes O(n log n)

mergesort is O(n log n) removing duplicate from a sorted list is O(n). do you know why? therefore the entire operation takes O(n log n)


Heapsort is an in-place sort. You could modify the "siftUp" or "siftDown" function to simply remove the element if it encounters a parent that is equal. This would be O(n log n)

function siftUp(a, start, end) is
 input:  start represents the limit of how far up the heap to sift.
               end is the node to sift up.
 child := end 
 while child > start
     parent := floor((child - 1) ÷ 2)
     if a[parent] < a[child] then (out of max-heap order)
         swap(a[parent], a[child])
         child := parent (repeat to continue sifting up the parent now)
     else if a[parent] == a[child] then
         remove a[parent]
     else
         return