What causes "indexing past lexsort depth" warning in Pandas?

Solution 1:

TL;DR: your index is unsorted and this severely impacts performance.

Sort your DataFrame's index using df.sort_index() to address the warning and improve performance.

I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").

To reproduce,

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

You'll notice that the second level is not properly sorted.

Now, try to index a specific cross section:

df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
  # encoding: utf-8

         col
one two     
c   u      9

You'll see the same behaviour with xs:

df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
  self.interact()

         col
one two     
c   u      9

The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

If you sort the index before slicing, you'll notice the difference:

df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]

         col
one two     
c   u      9


%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]

802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

df.index.is_lexsorted()
# False

df2.index.is_lexsorted()
# True

As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

If your index is not unique, add a cumcounted level first,

df.set_index(
    df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True) 
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)

Solution 2:

According to pandas advanced indexing (Sorting a Multiindex)

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex

And also:

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

According to them, you may need to ensure that indices are sorted properly.

Why use Glassfish instead of Apache? What's it strengths and weaknesses?

Question about a function definition (three dots in parameters..)

Split a list into parts based on a set of indexes in Python

Get distinct values of Queryset by field

Use SQL Server time datatype in C#.NET application?

How do static variables in lambda function objects work?

Using Javascript: How to create a 'Go Back' link that takes the user to a link if there's no history for the tab or window?

How to close a branch WITHOUT removing it from history in git?

UnexpectedRollbackException: Transaction rolled back because it has been marked as rollback-only

What does WEB-INF stand for in a Java EE web application? [closed]

How to "debug" Haskell with printfs?