Efficient way to remove half of the duplicate items in a list
If order isn't important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.
l = [1,8,8,8,1,3,3,8]
l.sort()
# Get all odd indexes
odd = l[1::2]
# Get all even indexes
even = l[::2]
print(odd)
print(odd == even)
Result:
[1, 3, 8, 8]
True
Use a counter to keep track of the count of each element
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]
Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
print(res)
Output
[1,8,8,3]
EDIT Comparing time/expense of each method
Using the timeit
module shows that this approach is 2.7 times faster than using a counter first.
i.e.
def one():
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
#print(res)
def two():
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
o=timeit.Timer(one)
t=timeit.Timer(two)
print(o.timeit(100000))
print(t.timeit(100000))
print(o.timeit(100000))
print(t.timeit(100000))
Output (seconds)
0.28666
0.80822
0.28678
0.80113
If order isn't important, then Wimanicesir's method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).
UPDATE
I suspected that using the Counter
method in two
(unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one
(ordered)
count={}
for i in l:
if i in count: count[i]+=1
else: count[i]=1
which was much faster than Counter
. Replacing Counter
in two
of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two
, however. And much faster for unordered result to use Wimanicesir's method.