Pandas: convert dtype 'object' to int
I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers. I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.
Here is an example:
>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
id date purchase
1 abc1 2016-05-22 1
2 abc2 2016-05-29 0
3 abc3 2016-05-22 2
4 abc4 2016-05-22 0
>>> df.dtypes
id object
date object
purchase object
dtype: object
Converting the df['date']
to a datetime works:
>>> pd.to_datetime(df['date'])
1 2016-05-22
2 2016-05-29
3 2016-05-22
4 2016-05-22
Name: date, dtype: datetime64[ns]
But I get an error when trying to convert the df['purchase']
to an integer:
>>> df['purchase'].astype(int)
....
pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()
TypeError: long() argument must be a string or a number, not 'java.lang.Long'
NOTE: I get a similar error when I tried .astype('float')
And when trying to convert to a string, nothing seems to happen.
>>> df['id'].apply(str)
1 abc1
2 abc2
3 abc3
4 abc4
Name: id, dtype: object
Documenting the answer that worked for me based on the comment by @piRSquared.
I needed to convert to a string first, then an integer.
>>> df['purchase'].astype(str).astype(int)
pandas >= 1.0
convert_dtypes
The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.
df = pd.DataFrame({
'a': [1, 2, np.nan],
'b': [True, False, np.nan]}, dtype=object)
df
a b
0 1 True
1 2 False
2 NaN NaN
df['a'].astype(str).astype(int) # raises ValueError
This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes
:
df.convert_dtypes()
a b
0 1 True
1 2 False
2 <NA> <NA>
df.convert_dtypes().dtypes
a Int64
b boolean
dtype: object
If your data has junk text mixed in with your ints, you can use pd.to_numeric
as an initial step:
s = pd.Series(['1', '2', '...'])
s.convert_dtypes() # converts to string, which is not what we want
0 1
1 2
2 ...
dtype: string
# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 NaN
dtype: float64
# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes()
0 1
1 2
2 <NA>
dtype: Int64
It's simple
pd.factorize(df.purchase)[0]
Example:
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)
My train data contains three features are object after applying astype
it converts the object into numeric but before that, you need to perform some preprocessing steps:
train.dtypes
C12 object
C13 object
C14 Object
train['C14'] = train.C14.astype(int)
train.dtypes
C12 object
C13 object
C14 int32