Why can pandas DataFrames change each other?
I'm trying to keep of a copy of a pandas DataFrame, so that I can modify it while saving the original. But when I modify the copy, the original dataframe changes too. Ex:
df1=pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df1
col1 col2
a 1
b 2
c 3
d 4
df2=df1
df2['col2']=df2['col2']+1
df1
col1 col2
a 2
b 3
c 4
d 5
I set df2
equal to df1
, then when I modified df2
, df1
also changed. Why is this and is there any way to save a "backup" of a pandas DataFrame without it being modified?
This is much deeper than dataframes: you are thinking about Python variables the wrong way. Python variables are pointers, not buckets. That is to say, when you write
>>> y = [1, 2, 3]
You are not putting [1, 2, 3]
into a bucket called y
; rather you are creating a pointer named y
which points to [1, 2, 3]
.
When you then write
>>> x = y
you are not putting the contents of y
into a bucket called x
; you are creating a pointer named x
which points to the same thing that y
points to. Thus:
>>> x[1] = 100
>>> print(y)
[1, 100, 3]
because x
and y
point to the same object, modifying it via one pointer modifies it for the other pointer as well. If you'd like to point to a copy instead, you need to explicitly create a copy. With lists you can do it like this:
>>> y = [1, 2, 3]
>>> x = y[:]
>>> x[1] = 100
>>> print(y)
[1, 2, 3]
With DataFrames, you can create a copy with the copy()
method:
>>> df2 = df1.copy()
You need to make a copy:
df2 = df1.copy()
df2['col2'] = df2['col2'] + 1
print(df1)
Output:
col1 col2
0 a 1
1 b 2
2 c 3
3 d 4
You just create a second name for df1
with df2 = df1
.