Remap values in pandas column with a dict, preserve NaNs
Solution 1:
You can use .replace
. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series
, i.e. df["col1"].replace(di, inplace=True)
.
Solution 2:
map
can be much faster than replace
If your dictionary has more than a couple of keys, using map
can be much faster than replace
. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map
most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna
:
df['col1'].map(di).fillna(df['col1'])
as in @jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit
, it appears that map
is approximately 10x faster than replace
.
Note that your speedup with map
will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.