Comparing two pandas dataframes for differences
I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...
How can I compare two dataframes to check if they're the same or not?
csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata
# ... do stuff with csvdata dataframe
if csvdata_old != csvdata:
csvdata.to_csv('csvfile.csv', index=False)
Any ideas?
Solution 1:
You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):
csvdata_old = csvdata.copy()
To check whether they are equal, you can use assert_frame_equal as in this answer:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)
You can wrap this in a function with something like:
try:
assert_frame_equal(csvdata, csvdata_old)
return True
except: # appeantly AssertionError doesn't catch all
return False
There was discussion of a better way...
Solution 2:
Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.
def get_different_rows(source_df, new_df):
"""Returns just the rows from the new dataframe that differ from the source dataframe"""
merged_df = source_df.merge(new_df, indicator=True, how='outer')
changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
return changed_rows_df.drop('_merge', axis=1)
Solution 3:
Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.
Solution 4:
Check using: df_1.equals(df_2) # Returns True or False, details herebelow
In [45]: import numpy as np
In [46]: import pandas as pd
In [47]: np.random.seed(5)
In [48]: df_1= pd.DataFrame(np.random.randn(3,3))
In [49]: df_1
Out[49]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [50]: np.random.seed(5)
In [51]: df_2= pd.DataFrame(np.random.randn(3,3))
In [52]: df_2
Out[52]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [53]: df_1.equals(df_2)
Out[53]: True
In [54]: df_3= pd.DataFrame(np.random.randn(3,3))
In [55]: df_3
Out[55]:
0 1 2
0 -0.329870 -1.192765 -0.204877
1 -0.358829 0.603472 -1.664789
2 -0.700179 1.151391 1.857331
In [56]: df_1.equals(df_3)
Out[56]: False
Solution 5:
A more accurate comparison should check for index names separately, because DataFrame.equals
does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.
df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')
df1.equals(df2)
True
df1.index.names == df2.index.names
False
Note: using index.names
instead of index.name
makes it work for multi-indexed dataframes as well.