In pandas, is inplace = True considered harmful, or not?
This has been discussed before, but with conflicting answers:
- in-place is good!
- in-place is bad!
What I'm wondering is:
- Why is
inplace = False
the default behavior? - When is it good to change it? (well, I'm allowed to change it, so I guess there's a reason).
- Is this a safety issue? that is, can an operation fail/misbehave due to
inplace = True
? - Can I know in advance if a certain
inplace = True
operation will "really" be carried out in-place?
My take so far:
- Many Pandas operations have an
inplace
parameter, always defaulting toFalse
, meaning the original DataFrame is untouched, and the operation returns a new DF. - When setting
inplace = True
, the operation might work on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.
pros of inplace = True
:
- Can be both faster and less memory hogging (the first link shows
reset_index()
runs twice as fast and uses half the peak memory!).
pros of inplace = False
:
- Allows chained/functional syntax:
df.dropna().rename().sum()...
which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this). - When using
inplace = True
on an object which is potentially a slice/view of an underlying DF, Pandas has to do aSettingWithCopy
check, which is expensive.inplace = False
avoids this. - Consistent & predictable behavior behind the scenes.
So, putting the copy-vs-view issue aside, it seems more performant to always use inplace = True
, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?
Solution 1:
In pandas, is inplace = True considered harmful, or not?
Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the inplace
argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplace
argument:
-
inplace
, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits -
inplace
does not work with method chaining -
inplace
can lead to the dreadedSettingWithCopyWarning
when called on a DataFrame column, and may sometimes fail to update the column in-place
The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly.
We take a look at the points above in more depth.
Performance
It is a common misconception that using inplace=True
will lead to more efficient or optimized code. In general, there are no performance benefits to using inplace=True
(but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.
Method Chaininginplace=True
also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=True
can trigger the SettingWithCopyWarning
:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Which can cause unexpected behavior.
Solution 2:
If inplace
was the default then the DataFrame would be mutated for all names that currently reference it.
A simple example, say I have a df
:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.
However, I now need to do some operations which require a different sort order:
def f(frame):
df = frame.sort_values('a')
# if we did frame.sort_values('a', inplace=True) here without
# making it explicit - our caller is going to wonder what happened
# do something
return df
That's fine - my original df
remains the same. However, if inplace=True
were the default then my original df
will now be sorted as a side-effect of f()
in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why.
Even with basic Python builtin mutables, you can observe this:
data = [3, 2, 1]
def f(lst):
lst.sort()
# I meant lst = sorted(lst)
for item in lst:
print(item)
f(data)
for item in data:
print(item)
# huh!? What happened to my data - why's it not 3, 2, 1?