sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check wheter any of the element is NaN, and not whether the return value of the any
function is a number...
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df
before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df
, such as
df = df[df.label=='desired_one']
This is my function (based on this) to clean the dataset of nan
, Inf
, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)