converting currency with $ to numbers in Python pandas
I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
Solution 1:
@EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
Solution 2:
You can also use locale
as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
Solution 3:
You can use the vectorised str
methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype
change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
Solution 4:
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
Solution 5:
You can use the methodstr.replace
and the regex '\D'
to remove all nondigit characters or '[^.0-9]'
to keep digits and decimal points:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^.0-9]', ''))