Inserting non duplicate rows in a table without dropping the table

I want to add a dataframe to my DB that contains some duplicate rows, but because my real database is very big, I don't want to drop the existing table and add it again with the updated rows, as it require me to pull it again from an API

I think the correct aproach could be to df.read_sql() and then compare the result with df2 and then only insert the rows which are not allready existing

import sqlalchemy as db
import pandas as pd

engine = db.create_engine('sqlite:///test.db', echo = True)

data1 = {'Month':['June', 'July', 'August'],
        'Number':[20, 21, 19]}
df = pd.DataFrame(data1)
df.to_sql("ExampleTable", engine,if_exists="append", index=False)

data2 = {'Month':['May','June', 'July', 'August', 'Septemper'],
        'Number':[11, 20, 21, 19, 14, 15]}
df2 = pd.DataFrame(data2)
df2.to_sql("ExampleTable", engine,if_exists="append", index=False)

Sounds like this problem boils down to finding the rows in a new dataframe, df2, that are not in an original dataframe, df, so that the rows that are unique to df2 can be appended to an existing sql table. A similar dataframe comparison scenario has been discussed in Stack Overflow posts here and here. If you only need to compare across the 'Month' column, then this should work:

df2_month_not_in_df = df2[~df2['Month'].isin(df['Month'])]

But that assumes no null 'Month' values to worry about, no need to include other columns like year, etc. A more rigorous approach that compares dataframe rows across multiple columns, possibly with null values, can be implemented following the answer by @toecsnar42 and others in the SO links above:

df_str_tuples = df.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df2_rows_in_df_filter = df2_str_tuples.isin(df_str_tuples)
df2_rows_not_in_df = df2[~df2_rows_in_df_filter]

For each dataframe, convert all values to a string type, which changes missing values to 'nan', then create a series by combining the values from each column into a single tuple for each row. Next, create a mask that is true whenever df2 has a row that matches a row in df. Finally, filter the df2 dataframe to only the rows where the mask is false (i.e. not in df), which builds the dataframe, df2_rows_not_in_df, that can be appended to the existing sql table using to_sql with if_exists="append". Other implementation approaches from past posts use pd.merge or pd.concat methods instead so different options to consider depending on your use case.


So just to clarify some things.

  1. SQLite does NOT store things in order based on data. It stores it based on ROW ID. However you can retrieve data and then have sqlite order the returned data based on a column(s) in the sql query.

  2. You're asking to insert data and then modify or update that data in a second request. Dataframes doesn't really offer this type of functionally. From what I understand is that pandas can insert to an already existing table or drop the table and insert the new data. So this is something you will most likely have to run raw sql queries to achieve. This post shows someone doing what is called an "upsert" in sql for pandas but required that they create a temp table and use that table to modify data in the original table. How do I perform an UPDATE of existing rows of a db table using a Pandas DataFrame?