从 Pandas 数据帧执行 SQL 更新语句 [英] Executing an SQL update statement from a pandas dataframe

查看:57
本文介绍了从 Pandas 数据帧执行 SQL 更新语句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文:我使用的是 MSSQL、pandas 和 pyodbc.

Context: I am using MSSQL, pandas, and pyodbc.

步骤:

  • 使用 pyodbc 从查询中获取数据帧(没问题)
  • 处理列以生成新(但已存在)列的上下文
  • 使用 UPDATE 语句填充辅助列(即 UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)

现在如何执行辅助列中的sql代码,而不是遍历每一行?

Now how do I execute the sql code in the auxilliary column, without looping through each row?

样本数据

前两列通过查询dbo.table得到,第三列存在但在数据库中为空.第四列仅存在于数据框中,用于准备对应于更新 dbo.table

The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table

<头>
IDraw已处理strSQL
1lorum.ipsum@test.comlorum ipsumUPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2rumlo.sumip@test.comrumlo sumipUPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3.........

我想以高效的方式在每一行中执行 SQL 脚本.

I would like to execute the SQL script in each row in an efficient manner.

推荐答案

在我对该问题的评论中推荐了 .executemany() 之后,@Charlieface 的后续评论建议使用表值参数 (TVP) 将提供更好的性能.我不认为这会产生很大的不同,但我错了.

After I recommended .executemany() in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.

对于名为 MillionRows 的现有表

For an existing table named MillionRows

ID  TextField
--  ---------
 1  foo
 2  bar
 3  baz
…

和表单的示例数据

num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]

我的测试使用标准 executemany() 调用和 cnxn.autocommit = Falsecrsr.fast_executemany = True

my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True

crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)

花了大约 180 秒(3 分钟).

took about 180 seconds (3 minutes).

但是,通过创建用户定义的表类型

However, by creating a user-defined table type

CREATE TYPE dbo.TextField_ID AS TABLE 
(
    TextField nvarchar(255) NULL, 
    ID int NOT NULL, 
    PRIMARY KEY (ID)
)

和一个存储过程

CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
    SET NOCOUNT ON;
    UPDATE MillionRows SET TextField = t.TextField
    FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END

当我使用

crsr.execute("{CALL mr_update (?)}", (rows,))

它在大约 80 秒内(不到一半的时间)完成了相同的更新.

it did the same update in approximately 80 seconds (less than half the time).

这篇关于从 Pandas 数据帧执行 SQL 更新语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆