Selecting/excluding sets of columns in pandas [duplicate]
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2
from a dataframe df1
that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
Solution 1:
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
Solution 2:
There is a new index method called difference
. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B
and D
from df
:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop
: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Solution 3:
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
Solution 4:
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]