从 Pandas 数据帧执行 SQL 更新语句 [英] Executing an SQL update statement from a pandas dataframe
问题描述
上下文:我使用的是 MSSQL、pandas 和 pyodbc.
Context: I am using MSSQL, pandas, and pyodbc.
步骤:
- 使用 pyodbc 从查询中获取数据帧(没问题)
- 处理列以生成新(但已存在)列的上下文
- 使用 UPDATE 语句填充辅助列(即
UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID
)
现在如何执行辅助列中的sql代码,而不是遍历每一行?
Now how do I execute the sql code in the auxilliary column, without looping through each row?
样本数据
前两列通过查询dbo.table
得到,第三列存在但在数据库中为空.第四列仅存在于数据框中,用于准备对应于更新 dbo.table
The first two columns are obtained by querying dbo.table
, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID | raw | 已处理 | strSQL |
---|---|---|---|
1 | lorum.ipsum@test.com | lorum ipsum | UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1 |
2 | rumlo.sumip@test.com | rumlo sumip | UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2 |
3 | ... | ... | ... |
我想以高效的方式在每一行中执行 SQL 脚本.
I would like to execute the SQL script in each row in an efficient manner.
推荐答案
在我对该问题的评论中推荐了 .executemany()
之后,@Charlieface 的后续评论建议使用表值参数 (TVP) 将提供更好的性能.我不认为这会产生很大的不同,但我错了.
After I recommended .executemany()
in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
对于名为 MillionRows 的现有表
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
和表单的示例数据
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
我的测试使用标准 executemany()
调用和 cnxn.autocommit = False
和 crsr.fast_executemany = True
my test using a standard executemany()
call with cnxn.autocommit = False
and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
花了大约 180 秒(3 分钟).
took about 180 seconds (3 minutes).
但是,通过创建用户定义的表类型
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
和一个存储过程
CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END
当我使用
crsr.execute("{CALL mr_update (?)}", (rows,))
它在大约 80 秒内(不到一半的时间)完成了相同的更新.
it did the same update in approximately 80 seconds (less than half the time).
这篇关于从 Pandas 数据帧执行 SQL 更新语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!