Efficient way to remove half of the duplicate items in a list

If order isn't important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.

l = [1,8,8,8,1,3,3,8]
l.sort()

# Get all odd indexes
odd = l[1::2]

# Get all even indexes
even = l[::2]

print(odd)
print(odd == even)

Result:

[1, 3, 8, 8]
True

Use a counter to keep track of the count of each element

from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
    res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]

Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.

l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1
  if count[i]%2: res.append(i)

print(res)

Output

[1,8,8,3]

EDIT Comparing time/expense of each method

Using the timeit module shows that this approach is 2.7 times faster than using a counter first.

i.e.

def one():
  l = [1,8,8,8,1,3,3,8]
  count={}
  res=[]
  for i in l:
    if i in count: count[i]+=1
    else: count[i]=1
    if count[i]%2: res.append(i)

  #print(res)


def two():
  from collections import Counter
  l = [1,8,8,8,1,3,3,8]
  res = []
  count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
  for key, val in count.items():
    res.extend(val//2 * [key])

o=timeit.Timer(one)

t=timeit.Timer(two)

print(o.timeit(100000))

print(t.timeit(100000))

print(o.timeit(100000))

print(t.timeit(100000))

Output (seconds)

If order isn't important, then Wimanicesir's method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).

UPDATE I suspected that using the Counter method in two (unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one (ordered)

count={}
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1

which was much faster than Counter. Replacing Counter in two of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two, however. And much faster for unordered result to use Wimanicesir's method.

Efficient way to remove half of the duplicate items in a list

Related

Recent Posts