Indexing a list with an unique index

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].

This is my code:

a = list(set(l))
res = [a.index(x) for x in l]

Which turns out to be very slow.

l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?


You can do this in O(N) time using a defaultdict and a list comprehension:

>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]

In Python 3 use __next__ instead of next.


If you're wondering how it works?

The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.

d at the end will look like this:

>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
            {10: 1, 20: 2, 15: 3})

The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.

The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.

Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.

res = []
conversion = {}
i = 0
for x in l:
    if x not in conversion:
        value = conversion[x] = i
        i += 1
    else:
        value = conversion[x]
    res.append(value)

Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:

    [1,1,2,3,1,2]

then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you

    import numpy as np
    l = [10,10,20,15,10,20]
    a = np.array(l)
    x,y = np.unique(a,return_inverse = True)

and for this example the output of y is:

    y = [0,0,2,1,0,2]

I tested this for 1,000,000 entries and it was done essentially immediately.


Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.

To make it O(n), get rid of index() and store indexes in a dictionary:

>>> idx, indexes = 1, {}
>>> for x in l:
...     if x not in indexes:
...         indexes[x] = idx
...         idx += 1
... 
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]

If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.