pyodbc/sqlAchemy 启用快速执行许多 [英] pyodbc/sqlAchemy enable fast execute many

查看:58
本文介绍了pyodbc/sqlAchemy 启用快速执行许多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

回答我的问题 如何在 Python + Pandas + sqlAlchemy + MSSQL/T-SQL 中加速数据整理 我被亲切地引导到 使用 pyODBC 的 fast_executemany 加速 pandas.DataFrame.to_sql 来自 @IljaEverilä.>

NB 出于测试目的,我只读取/写入 10k 行.

我添加了事件侦听器,并且 a) 函数被调用,但 b) 显然 executemany 未设置,因为 IF 失败且 cursor.fast_executemay 未设置.

def namedDbSqlAEngineCreate(dbName):# 创建引擎并切换到命名数据库# 如果成功则返回引擎,否则返回 None# 2018-08-23 向这个 https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc?rq=1 添加了 fast_executemanyengineStr = 'mssql+pyodbc://@' + defaultDSN引擎 = sqla.create_engine(engineStr, echo=False)@event.listens_for(引擎,'before_cursor_execute')def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):# 打印(FUNC 调用")如果执行多个:打印('执行多次')cursor.fast_executemany = True尝试:engine.execute('USE' +dbName)返回(引擎)除了 sqla.exc.SQLAlchemyError 作为前:如果 ex.orig.args[0] == '08004':print('namedDbSqlAEngineCreate:数据库 %s 不存在'% dbName)别的:打印(例如.args[0])返回(无)

自然速度没有变化.

我原来问题中的代码在 to_sql 中没有变化

nasToFillDF.to_sql(name=tempTableName, con=engine.engine, if_exists='replace', chunksize=100, index=False)

因为我尝试过,根据示例,设置 chunksize = None 并收到错误消息(我之前遇到过)

<块引用>

(pyodbc.ProgrammingError) ('SQL 包含 -31072 个参数标记,但提供了 100000 个参数', 'HY000')

我做错了什么?我猜没有设置 receive_before_cursor_execute 的 executemany 参数,但如果这是答案,我不知道如何解决它.

安装程序是 pyodbc 4.0.23、sqlAchemy 1.2.6、Python 3.6.something

解决方案

您收到的错误是由 Pandas 0.23.0 版中引入的更改引起的,在 0.23.1 中恢复,并在 0.24.0 中重新引入,如此处.生成的 VALUES 子句包含 100,000 个参数标记,并且计数似乎存储在一个有符号的 16 位整数中,所以它溢出了,你会觉得很有趣

<块引用>

SQL 包含 -31072 个参数标记,但提供了 100000 个参数

您可以自己检查:

在 [16]: 100000 % (2 ** 16) - 2 ** 16出[16]:-31072

如果您想继续按原样使用 Pandas,您必须计算并提供合适的 chunksize 值,例如您使用的 100,同时考虑最大行数限制VALUES 子句的最大参数限制为 1,000,存储过程的最大参数限制为 2,100.在链接的问答.

在更改之前 Pandas 过去总是使用 executemany() 插入数据时.较新的版本检测正在使用的 方言 是否支持 INSERT 中的 VALUES 子句.这个检测发生在 SQLTable.insert_statement() 并且无法控制,这是一种耻辱,因为 PyODBC 修复了他们的 executemany() 性能,如果启用了正确的标志.

为了强制 Pandas 再次使用 executemany() 和 PyODBC SQLTable 必须是 monkeypatched:

导入pandas.io.sqldef insert_statement(self, data, conn):返回 self.table.insert(), 数据pandas.io.sql.SQLTable.insert_statement = insert_statement

如果 Cursor.fast_executemany 标志未设置,因此请记住设置正确的事件处理程序.

这是一个简单的性能比较,使用以下数据框:

In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})

香草熊猫 0.24.0:

在 [14]: %time df.to_sql('foo', engine, chunksize=209)CPU 时间:用户 2 分 9 秒,系统:2.16 秒,总计:2 分 11 秒挂墙时间:2分26秒

启用快速执行的猴子补丁熊猫:

在 [10]: %time df.to_sql('foo', engine, chunksize=500000)CPU 时间:用户 12.2 秒,系统:981 毫秒,总计:13.2 秒挂墙时间:38 秒

In response to my question How to speed up data wrangling A LOT in Python + Pandas + sqlAlchemy + MSSQL/T-SQL I was kindly directed to Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC by @IljaEverilä.

NB For test purposes I am only reading/writing 10k rows.

I added the event listener and a) the function is called but b) clearly executemany is not set as the IF fails and cursor.fast_executemay is not set.

def namedDbSqlAEngineCreate(dbName):
    # Create an engine and switch to the named db
    # returns the engine if successful and None if not
    # 2018-08-23 added fast_executemany accoding to this https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc?rq=1
    engineStr = 'mssql+pyodbc://@' + defaultDSN
    engine = sqla.create_engine(engineStr, echo=False)

    @event.listens_for(engine, 'before_cursor_execute')
    def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
        # print("FUNC call")
        if executemany:
            print('executemany')
            cursor.fast_executemany = True
    try:
        engine.execute('USE ' +dbName)
        return(engine)
    except sqla.exc.SQLAlchemyError as ex:
        if ex.orig.args[0] == '08004':
            print('namedDbSqlAEngineCreate:Database %s does not exist' % dbName)
        else:
            print(ex.args[0])
        return(None)

Naturally there is no change in speed.

The code in my original question is unchanged in the to_sql

nasToFillDF.to_sql(name=tempTableName, con=engine.engine, if_exists='replace', chunksize=100, index=False)

because I tried, per the example, setting chunksize = None and receive the error message (which I had encountered previously)

(pyodbc.ProgrammingError) ('The SQL contains -31072 parameter markers, but 100000 parameters were supplied', 'HY000')

What have I done wrong? I guess the executemany parameter of the receive_before_cursor_execute is not set, but if that is the answer I have no idea how to fix it.

Setup is pyodbc 4.0.23, sqlAchemy 1.2.6, Python 3.6.something

解决方案

The error you received is caused by changes introduced in Pandas version 0.23.0, reverted in 0.23.1, and reintroduced in 0.24.0, as explained here. The produced VALUES clause contains 100,000 parameter markers and it'd seem that the count is stored in a signed 16 bit integer, so it overflows and you get the funny

The SQL contains -31072 parameter markers, but 100000 parameters were supplied

You can check for yourself:

In [16]: 100000 % (2 ** 16) - 2 ** 16
Out[16]: -31072

If you would like to keep on using Pandas as is, you will have to calculate and provide a suitable chunksize value, such as the 100 you were using, taking into account both the maximum row limit of 1,000 for VALUES clause and the maximum parameter limit of 2,100 for stored procedures. The details are again explained in the linked Q/A.

Before the change Pandas used to always use executemany() when inserting data. Newer versions detect if the dialect in use supports VALUES clause in INSERT. This detection happens in SQLTable.insert_statement() and cannot be controlled, which is a shame since PyODBC fixed their executemany() performance, given the right flag is enabled.

In order to force Pandas to use executemany() with PyODBC again SQLTable has to be monkeypatched:

import pandas.io.sql

def insert_statement(self, data, conn):
    return self.table.insert(), data

pandas.io.sql.SQLTable.insert_statement = insert_statement

This will be horribly slow, if the Cursor.fast_executemany flag is not set, so remember to set the proper event handler.

Here is a simple performance comparison, using the following dataframe:

In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})

Vanilla Pandas 0.24.0:

In [14]: %time df.to_sql('foo', engine, chunksize=209)
CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
Wall time: 2min 26s

Monkeypatched Pandas with fast executemany enabled:

In [10]: %time df.to_sql('foo', engine, chunksize=500000)
CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
Wall time: 38 s

这篇关于pyodbc/sqlAchemy 启用快速执行许多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆