Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:

con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()

for index, row in df.iterrows():
    sql = 'update table set column = %s where column = %s'
    cur.execute(sql, (row['whatver'], row['something']))
con.commit()

But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:

engine = create_engine('postgresql+psycopg2://user:pswd@mydb')
df.to_sql('table', engine, if_exists='append')

It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?


Solution 1:

Consider a temp table which would be exact replica of your final table, cleaned out with each run:

engine = create_engine('postgresql+psycopg2://user:pswd@mydb')
df.to_sql('temp_table', engine, if_exists='replace')

sql = """
    UPDATE final_table AS f
    SET col1 = t.col1
    FROM temp_table AS t
    WHERE f.id = t.id
"""

with engine.begin() as conn:     # TRANSACTION
    conn.execute(sql)

Solution 2:

It looks like you are using some external data stored in df for the conditions on updating your database table. If it is possible why not just do a one-line sql update?

If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you can definitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace" to replace the DB table with the new updated table.

df = pandas.read_sql("select * from your_table;", engine)

#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"

#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")

Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues

Solution 3:

I have so far not seen a case where the pandas sql connector can be used in any scalable way to update database data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.

What I would recommend is to dump your entire dataframe as CSV using

df.to_csv('filename.csv', encoding='utf-8')

Then loading the CSV into the database using COPY for PostgreSQL or LOAD DATA INFILE for MySQL.

If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.

If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.

In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.

Solution 4:

I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.