Factorize a column of strings in pandas
As the question says, I have a data frame df_original
which is quite large but looks like:
ID Count Column 2 Column 3 Column 4
RowX 1 234. 255. yes. 452
RowY 1 123. 135. no. 342
RowW 1 234. 235. yes. 645
RowJ 1 123. 115. no. 342
RowA 1 234. 285. yes. 233
RowR 1 123. 165. no. 342
RowX 2 234. 255. yes. 234
RowY 2 123. 135. yes. 342
RowW 2 234. 235. yes. 233
RowJ 2 123. 115. yes. 342
RowA 2 234. 285. yes. 312
RowR 2 123. 165. no. 342
.
.
.
RowX 1233 234. 255. yes. 133
RowY 1233 123. 135. no. 342
RowW 1233 234. 235. no. 253
RowJ 1233 123. 115. yes. 342
RowA 1233 234. 285. yes. 645
RowR 1233 123. 165. no. 342
I am trying to get rid of the text data and replace it with a predefined numerical equivalent. For example, in this case, I'd like to replace Column3
's yes
or no
values with 1
or 0
respectively. Is there a way to do this without me having to manually go in and alter the values?
series
RowX yes
RowY no
RowW yes
RowJ no
RowA yes
RowR no
RowX yes
RowY yes
RowW yes
RowJ yes
RowA yes
RowR no
Name: Column 3, dtype: object
pd.factorize
1 - series.factorize()[0]
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
np.where
np.where(series == 'yes', 1, 0)
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
pd.Categorical
/astype('category')
pd.Categorical(series).codes
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0], dtype=int8)
series.astype('category').cat.codes
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
dtype: int8
pd.Series.replace
series.replace({'yes' : 1, 'no' : 0})
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
Name: Column 3, dtype: int64
A fun, generalised version of the above:
series.replace({r'^(?!yes).*$' : 0}, regex=True).astype(bool).astype(int)
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
Name: Column 3, dtype: int64
Anything that is not "yes"
is 0
.