Why can pandas DataFrames change each other?

I'm trying to keep of a copy of a pandas DataFrame, so that I can modify it while saving the original. But when I modify the copy, the original dataframe changes too. Ex:

df1=pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df1

    col1    col2
    a       1
    b       2
    c       3
    d       4

df2=df1
df2['col2']=df2['col2']+1
df1

    col1    col2
    a       2
    b       3
    c       4
    d       5

I set df2 equal to df1, then when I modified df2, df1 also changed. Why is this and is there any way to save a "backup" of a pandas DataFrame without it being modified?

This is much deeper than dataframes: you are thinking about Python variables the wrong way. Python variables are pointers, not buckets. That is to say, when you write

>>> y = [1, 2, 3]

You are not putting [1, 2, 3] into a bucket called y; rather you are creating a pointer named y which points to [1, 2, 3].

When you then write

>>> x = y

you are not putting the contents of y into a bucket called x; you are creating a pointer named x which points to the same thing that y points to. Thus:

>>> x[1] = 100
>>> print(y)
[1, 100, 3]

because x and y point to the same object, modifying it via one pointer modifies it for the other pointer as well. If you'd like to point to a copy instead, you need to explicitly create a copy. With lists you can do it like this:

>>> y = [1, 2, 3]
>>> x = y[:]
>>> x[1] = 100
>>> print(y)
[1, 2, 3]

With DataFrames, you can create a copy with the copy() method:

>>> df2 = df1.copy()

You need to make a copy:

df2 = df1.copy()

df2['col2'] = df2['col2'] + 1
print(df1)

Output:

  col1  col2
0    a     1
1    b     2
2    c     3
3    d     4

You just create a second name for df1 with df2 = df1.

Why can pandas DataFrames change each other?

Related

Recent Posts