如何使用Pandas DataFrame对数据库表的现有行执行UPDATE? [英] How do I perform an UPDATE of existing rows of a db table using a Pandas DataFrame?

查看:2333
本文介绍了如何使用Pandas DataFrame对数据库表的现有行执行UPDATE?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查询MySql数据库表的子集,将结果提供给Pandas DataFrame,更改一些数据,然后将更新后的行写回到同一表中.我的表大小为〜1MM行,而我要更改的行数相对较小(<50,000),因此恢复整个表并执行df.to_sql(tablename,engine, if_exists='replace')并不是可行的选择.是否有一种简单的方法来更新已更改的行,而无需遍历DataFrame中的每一行?

I am attempting to query a subset of a MySql database table, feed the results into a Pandas DataFrame, alter some data, and then write the updated rows back to the same table. My table size is ~1MM rows, and the number of rows I will be altering will be relatively small (<50,000) so bringing back the entire table and performing a df.to_sql(tablename,engine, if_exists='replace') isn't a viable option. Is there a straightforward way to UPDATE the rows that have been altered without iterating over every row in the DataFrame?

我知道这个项目试图模拟一个"upsert"工作流,但是看来它只能完成插入新的非重复行的任务,而不是更新现有行的一部分:

I am aware of this project, which attempts to simulate an "upsert" workflow, but it seems it only accomplishes the task of inserting new non-duplicate rows rather than updating parts of existing rows:

GitHub Pandas-to_sql-upsert

这是我试图在更大范围内完成的工作的框架:

Here is a skeleton of what I'm attempting to accomplish on a much larger scale:

import pandas as pd
from sqlalchemy import create_engine
import threading

#Get sample data
d = {'A' : [1, 2, 3, 4], 'B' : [4, 3, 2, 1]}
df = pd.DataFrame(d)

engine = create_engine(SQLALCHEMY_DATABASE_URI)

#Create a table with a unique constraint on A.
engine.execute("""DROP TABLE IF EXISTS test_upsert """)
engine.execute("""CREATE TABLE test_upsert (
                  A INTEGER,
                  B INTEGER,
                  PRIMARY KEY (A)) 
                  """)

#Insert data using pandas.to_sql
df.to_sql('test_upsert', engine, if_exists='append', index=False)

#Alter row where 'A' == 2
df_in_db.loc[df_in_db['A'] == 2, 'B'] = 6

现在,我想将df_in_db写回到我的'test_upsert'表中,并反映出更新的数据.

Now I would like to write df_in_db back to my 'test_upsert' table with the updated data reflected.

这个SO问题非常相似,其中一条评论建议使用"sqlalchemy表类"执行任务.

This SO question is very similar, and one of the comments recommends using an "sqlalchemy table class" to perform the task.

使用sqlalchemy表类更新表

如果这是最好的(唯一的)实现方式,那么有人可以在上面针对我的特定情况扩展我的实现方式吗?

Can anyone expand on how I would implement this for my specific case above if that is the best (only?) way to implement it?

推荐答案

我认为最简单的方法是:

I think the easiest way would be to:

首先删除将要插入"的那些行.可以循环执行此操作,但是对于较大的数据集(5K +行)来说效率不高,因此我将这部分DF保存到临时MySQL表中:

first delete those rows that are going to be "upserted". This can be done in a loop, but it's not very efficient for bigger data sets (5K+ rows), so i'd save this slice of the DF into a temporary MySQL table:

# assuming we have already changed values in the rows and saved those changed rows in a separate DF: `x`
x = df[mask]  # `mask` should help us to find changed rows...

# make sure `x` DF has a Primary Key column as index
x = x.set_index('a')

# dump a slice with changed rows to temporary MySQL table
x.to_sql('my_tmp', engine, if_exists='replace', index=True)

conn = engine.connect()
trans = conn.begin()

try:
    # delete those rows that we are going to "upsert"
    engine.execute('delete from test_upsert where a in (select a from my_tmp)')
    trans.commit()

    # insert changed rows
    x.to_sql('test_upsert', engine, if_exists='append', index=True)
except:
    trans.rollback()
    raise

PS我没有测试此代码,因此它可能有一些小错误,但它应该给您一个想法...

PS i didn't test this code so it might have some small bugs, but it should give you an idea...

这篇关于如何使用Pandas DataFrame对数据库表的现有行执行UPDATE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆