How to set dtypes by column in pandas DataFrame

I just ran into this, and the pandas issue is still open, so I'm posting my workaround. Assuming df is my DataFrame and dtype is a dict mapping column names to types:

for k, v in dtype.items():
    df[k] = df[k].astype(v)

(note: use dtype.iteritems() in python 2)

For the reference:

The list of allowed data types (NumPy dtypes): https://docs.scipy.org/doc/numpy-1.12.0/reference/arrays.dtypes.html
Pandas also supports some other types. E.g., category: http://pandas.pydata.org/pandas-docs/stable/categorical.html
The relevant GitHub issue: https://github.com/pandas-dev/pandas/issues/9287

As of pandas version 0.24.2 (the current stable release) it is not possible to pass an explicit list of datatypes to the DataFrame constructor as the docs state:

dtype : dtype, default None

    Data type to force. Only a single dtype is allowed. If None, infer

However, the dataframe class does have a static method allowing you to convert a numpy structured array to a dataframe so you can do:

>>> myarray = np.random.randint(0,5,size=(2,2))
>>> record = np.array(map(tuple,myarray),dtype=[('a',np.float),('b',np.int)])
>>> mydf = pd.DataFrame.from_records(record)
>>> mydf.dtypes
a    float64
b      int64
dtype: object

You may want to try passing in a dictionary of Series objects to the DataFrame constructor - it will give you much more specific control over the creation, and should hopefully be clearer what's going on. A template version (data1 can be an array etc.):

df = pd.DataFrame({'column1':pd.Series(data1, dtype='type1'),
                   'column2':pd.Series(data2, dtype='type2')})

And example with data:

df = pd.DataFrame({'A':pd.Series([1,2,3], dtype='int'),
                   'B':pd.Series([7,8,9], dtype='float')})

print (df)
   A  B
0  1  7.0
1  2  8.0
2  3  9.0

print (df.dtypes)
A     int32
B    float64
dtype: object

How to set dtypes by column in pandas DataFrame

Related

Recent Posts