Pandas: drop columns with all NaN's
I realize that dropping NaN
s from a dataframe is as easy as df.dropna
but for some reason that isn't working on mine and I'm not sure why.
Here is my original dataframe:
fish_frame1: 0 1 2 3 4 5 6 7
0 #0915-8 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN LIVE WGT NaN AMOUNT NaN TOTAL
2 GBW COD NaN NaN 2,280 NaN $0.60 NaN $1,368.00
3 POLLOCK NaN NaN 1,611 NaN $0.01 NaN $16.11
4 WHAKE NaN NaN 441 NaN $0.70 NaN $308.70
5 GBE HADDOCK NaN NaN 2,788 NaN $0.01 NaN $27.88
6 GBW HADDOCK NaN NaN 16,667 NaN $0.01 NaN $166.67
7 REDFISH NaN NaN 932 NaN $0.01 NaN $9.32
8 GB WINTER FLOUNDER NaN NaN 145 NaN $0.25 NaN $36.25
9 GOM WINTER FLOUNDER NaN NaN 25,070 NaN $0.35 NaN $8,774.50
10 GB YELLOWTAIL NaN NaN 26 NaN $1.75 NaN $45.50
The code that follows is an attempt to drop all NaN
s as well as any columns with more than 3 NaN
s (either one, or both, should work I think):
fish_frame.dropna()
fish_frame.dropna(thresh=len(fish_frame) - 3, axis=1)
This produces:
fish_frame1 after dropna: 0 1 2 3 4 5 6 7
0 #0915-8 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN LIVE WGT NaN AMOUNT NaN TOTAL
2 GBW COD NaN NaN 2,280 NaN $0.60 NaN $1,368.00
3 POLLOCK NaN NaN 1,611 NaN $0.01 NaN $16.11
4 WHAKE NaN NaN 441 NaN $0.70 NaN $308.70
5 GBE HADDOCK NaN NaN 2,788 NaN $0.01 NaN $27.88
6 GBW HADDOCK NaN NaN 16,667 NaN $0.01 NaN $166.67
7 REDFISH NaN NaN 932 NaN $0.01 NaN $9.32
8 GB WINTER FLOUNDER NaN NaN 145 NaN $0.25 NaN $36.25
9 GOM WINTER FLOUNDER NaN NaN 25,070 NaN $0.35 NaN $8,774.50
10 GB YELLOWTAIL NaN NaN 26 NaN $1.75 NaN $45.50
I'm a novice with Pandas so I'm not sure if this isn't working because I'm doing something wrong or I'm misunderstanding something or misusing a function. Any help is appreciated thanks.
Solution 1:
From the dropna
docstring:
Drop the columns where all elements are NaN:
df.dropna(axis=1, how='all')
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
Solution 2:
dropna()
drops the null values and returns a dataFrame. Assign it back to the original dataFrame.
fish_frame = fish_frame.dropna(axis = 1, how = 'all')
Referring to your code:
fish_frame.dropna(thresh=len(fish_frame) - 3, axis=1)
This would drop columns with 7 or more NaN's (assuming len(df) = 10), if you want to drop columns with more than 3 Nan's like you've mentioned, thresh should be equal to 3.
Solution 3:
dropna()
by default returns a dataframe (defaults to inplace=False
behavior) and thus needs to be assigned to a new dataframe for it to stay in your code.
So for example,
fish_frame = fish_frame.dropna()
As to why your dropna
is returning an empty dataframe, I'd recommend you look at the "how" argument in the dropna method (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html). Also bear in mind, axis=0 corresponds to columns, and axis=1 corresponds to rows.
So to remove columns with all "NAs", axis=0, how="any" should do the trick:
fish_frame = fish_frame.dropna(axis=0, how="any")
Finally, the "thresh" argument designates explicitly how many NA's are necessary for a drop to occur. So
fish_frame = fish_frame.dropna(axis=0, thresh=3, how="any")
should work fine and dandy to remove any column with three NA's.
Also, as Corley pointed out, how="any" is the default and is thus not necessary.
Solution 4:
Another solution would be to create a boolean dataframe with True values at not-null positions and then take the columns having at least one True value. Below line removes columns with all NaN values.
df = df.loc[:,df.notna().any(axis=0)]
If you want to remove columns having at least one missing (NaN) value;
df = df.loc[:,df.notna().all(axis=0)]
This approach is particularly useful in removing columns containing empty strings, zeros or basically any given value. For example;
df = df.loc[:,(df!='').all(axis=0)]
removes columns having at least one empty string.