Pandas Dataframe: Why is astype method producing int32 results with an argument of int
I am using Python 3.8 and Pandas 1.3. Here is some sample code:
data_dc = {'Dates': ['10212021','11152021','01142022','02122022']}
df1 = pd.DataFrame(data_dc)
print(df1['Dates'].astype(int))
Results:
0 10212021
1 11152021
2 1142022
3 2122022
Name: Dates, dtype: int32
I specified a Python data type (int) as the argument of the astype method and expected a dtype of the Dates column to be int64. Instead, I got int32. Is this a bug or am I doing something wrong? This is easy to work around, but I like to make sure I understand what to expect from the software.
Solution 1:
Pandas uses numpy datatypes under the hood. From the numpy documentation,
The default NumPy behavior is to create arrays in either 32 or 64-bit signed integers (platform dependent and matches C int size) or double precision floating point numbers, int32/int64 and float, respectively. If you expect your integer arrays to be a specific type, then you need to specify the dtype while you create the array.
It is not a bug and you should be specifying dtypes if you have a specific use or want to be platform agnostic. To rephrase your question, what is np.dtype(int)
on my platform?
On windows, as some of the comments suggest, it appears to be a C signed long
(32 bits). You can even get numpy to throw an overflow error to confirm this.
>>> import numpy as np
>>> np.array([2_147_483_648], dtype=int)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long