Drop pandas columns based on lambda during method chaining
I want to drop columns of a pandas DataFrame using a lambda. Questions like How can I use lambda to drop column of pandas dataframe? discuss this, but I want to be able to do it within a method chaining construct (which is not a condition in the other question). How can I do this?
Other questions, e.g. Method chaining solution to drop column level in pandas DataFrame, discuss column levels, but that is also different.
Solution 1:
Let's assume you want to use the name of the column as an indicator to drop each column. Here, I provide to options:
from time import time
import numpy as np
import pandas as pd
d = pd.DataFrame(np.random.randint(0, 10, (10000, 7)), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
print(d.shape)
t0 = time()
d.apply(lambda row: pd.Series([row[col_name] for col_name in row.index if col_name == 'a'],
index=[col_name for col_name in row.index if col_name == 'a']), axis=1, result_type="expand")
print(time() - t0)
t0 = time()
d.apply(lambda column: len(column)*[np.nan] if column.name == 'a' else column, axis=0).dropna(how='all', axis=1)
print(time() - t0)
(10000, 7)
5.570859670639038
0.005705833435058594
Since the column is available to extract any condition you like, you can adjust the condition accordingly.
While the first solution is somewhat less hacky, it goes over each row, making it super slow. The second version is very fast, albeit a bit hacky: You need to be sure, that no other column has only np.nan
values.
Maybe somewhat else has a solution that is fast and still not a hack.