Comparing two pandas dataframes for differences

I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...

How can I compare two dataframes to check if they're the same or not?

csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata

# ... do stuff with csvdata dataframe

if csvdata_old != csvdata:
    csvdata.to_csv('csvfile.csv', index=False)

Any ideas?


Solution 1:

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

try:
    assert_frame_equal(csvdata, csvdata_old)
    return True
except:  # appeantly AssertionError doesn't catch all
    return False

There was discussion of a better way...

Solution 2:

Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

def get_different_rows(source_df, new_df):
    """Returns just the rows from the new dataframe that differ from the source dataframe"""
    merged_df = source_df.merge(new_df, indicator=True, how='outer')
    changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
    return changed_rows_df.drop('_merge', axis=1)

Solution 3:

Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

Solution 4:

Check using: df_1.equals(df_2) # Returns True or False, details herebelow

In [45]: import numpy as np

In [46]: import pandas as pd

In [47]: np.random.seed(5)

In [48]: df_1= pd.DataFrame(np.random.randn(3,3))

In [49]: df_1
Out[49]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [50]: np.random.seed(5)

In [51]: df_2= pd.DataFrame(np.random.randn(3,3))

In [52]: df_2
Out[52]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [53]: df_1.equals(df_2)
Out[53]: True


In [54]: df_3= pd.DataFrame(np.random.randn(3,3))

In [55]: df_3
Out[55]: 
          0         1         2
0 -0.329870 -1.192765 -0.204877
1 -0.358829  0.603472 -1.664789
2 -0.700179  1.151391  1.857331

In [56]: df_1.equals(df_3)
Out[56]: False

Solution 5:

A more accurate comparison should check for index names separately, because DataFrame.equals does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')

df1.equals(df2)
True

df1.index.names == df2.index.names
False

Note: using index.names instead of index.name makes it work for multi-indexed dataframes as well.