Pandas: convert dtype 'object' to int

I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers. I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.

Here is an example:

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
    id    date          purchase
 1  abc1  2016-05-22    1
 2  abc2  2016-05-29    0
 3  abc3  2016-05-22    2
 4  abc4  2016-05-22    0

>>> df.dtypes
 id          object
 date        object
 purchase    object
 dtype: object

Converting the df['date'] to a datetime works:

>>> pd.to_datetime(df['date'])
 1  2016-05-22
 2  2016-05-29
 3  2016-05-22
 4  2016-05-22
 Name: date, dtype: datetime64[ns] 

But I get an error when trying to convert the df['purchase'] to an integer:

>>> df['purchase'].astype(int)
 ....
 pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
 pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()

 TypeError: long() argument must be a string or a number, not 'java.lang.Long'

NOTE: I get a similar error when I tried .astype('float')

And when trying to convert to a string, nothing seems to happen.

>>> df['id'].apply(str)
 1 abc1
 2 abc2
 3 abc3
 4 abc4
 Name: id, dtype: object

Documenting the answer that worked for me based on the comment by @piRSquared.

I needed to convert to a string first, then an integer.

>>> df['purchase'].astype(str).astype(int)

pandas >= 1.0

convert_dtypes

The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.

df = pd.DataFrame({
     'a': [1, 2, np.nan], 
     'b': [True, False, np.nan]}, dtype=object) 
df                                                                         

     a      b
0    1   True
1    2  False
2  NaN    NaN

df['a'].astype(str).astype(int) # raises ValueError

This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes:

df.convert_dtypes()                                                        

      a      b
0     1   True
1     2  False
2  <NA>   <NA>

df.convert_dtypes().dtypes                                                 

a      Int64
b    boolean
dtype: object

If your data has junk text mixed in with your ints, you can use pd.to_numeric as an initial step:

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want

0      1
1      2
2    ...
dtype: string 

# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    NaN
dtype: float64

# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes() 

0       1
1       2
2    <NA>
dtype: Int64

It's simple

pd.factorize(df.purchase)[0]

Example:

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)

My train data contains three features are object after applying astype it converts the object into numeric but before that, you need to perform some preprocessing steps:

train.dtypes

C12       object
C13       object
C14       Object

train['C14'] = train.C14.astype(int)

train.dtypes

C12       object
C13       object
C14       int32