How to change values in a given column based on a condition and put those new values in a new column? [duplicate]
I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1
column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
Solution 1:
You can use .replace
. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series
, i.e. df["col1"].replace(di, inplace=True)
.
Solution 2:
map
can be much faster than replace
If your dictionary has more than a couple of keys, using map
can be much faster than replace
. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map
most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna
:
df['col1'].map(di).fillna(df['col1'])
as in @jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit
, it appears that map
is approximately 10x faster than replace
.
Note that your speedup with map
will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.
Solution 3:
There is a bit of ambiguity in your question. There are at least three two interpretations:
- the keys in
di
refer to index values - the keys in
di
refer todf['col1']
values - the keys in
di
refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di
are meant to refer to index values, then you could use the update
method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update
is doing.
Note how the keys in di
are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di
refer to df['col1']
values, then @DanAllan and @DSM show how to achieve this with replace
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di
were changed to match values in df['col1']
.
Case 3:
If the keys in di
refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di
are 0
and 2
, which with Python's 0-based indexing refer to the first and third locations.
Solution 4:
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.