使用 SQL Server 提高 pandas 的 to_sql() 性能 [英] Improve pandas' to_sql() performance with SQL Server

查看:292
本文介绍了使用 SQL Server 提高 pandas 的 to_sql() 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来找你是因为我无法解决 pandas.DataFrame.to_sql() 方法的问题.

I come to you because i cannot fix an issues with pandas.DataFrame.to_sql() method.

我已经在我的脚本和我的数据库之间建立了连接,我可以发送查询,但实际上它对我来说太慢了.

I've made the connection between my script and my database, i can send queries, but actually it's too slow for me.

我想找到一种方法来提高我的脚本在这方面的性能.也许有人会找到解决方案?

I would like to find a way to improve the performance of my script on this. Maybe someone will find a solution?

这是我的代码:

  engine = sqlalchemy.create_engine(con['sql']['connexion_string'])
  conn = engine.connect()
  metadata = sqlalchemy.Metadata()
  try : 
    if(con['sql']['strategy'] == 'NEW'): 
      query = sqlalchemy.Table(con['sql']['table'],metadata).delete()
      conn.execute(query)
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    elif(con['sql']['strategy'] == 'APPEND'):
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    else:
      pass
  except Exception as e:
    print(type(e))

当我停用 chunksize 和方法参数时它正在工作并且太慢了,这就是它太慢的时刻(几乎 3 分钟,3 万行).当我输入这些参数时,我得到一个 sqlalchemy.exc.ProgrammingError...

It's working and too slow when i retire chunksize and method parameters,it's this moment where it's too slow (almost 3 minutes for 30 thousand lines). When i put these parameters, i get an sqlalchemy.exc.ProgrammingError...

感谢您的帮助!

推荐答案

对于mssql+pyodbc,如果你

  1. 为 SQL Server 使用 Microsoft 的 ODBC 驱动程序,并且
  2. 在您的 create_engine 调用中启用 fast_executemany=True.
  1. use Microsoft's ODBC Driver for SQL Server, and
  2. enable fast_executemany=True in your create_engine call.

例如,此代码在我的网络上运行仅 3 秒多:

For example, this code runs in just over 3 seconds on my network:

from time import time
import pandas as pd
import sqlalchemy as sa

ngn_local = sa.create_engine("mssql+pyodbc://mssqlLocal64")
ngn_remote = sa.create_engine(
    (
        "mssql+pyodbc://sa:_whatever_@192.168.0.199/mydb"
        "?driver=ODBC+Driver+17+for+SQL+Server"
    ),
    fast_executemany=True,
)

df = pd.read_sql_query(
    "SELECT * FROM MillionRows WHERE ID <= 30000", ngn_local
)

t0 = time()
df.to_sql("pd_test", ngn_remote, index=False, if_exists="replace")
print(f"{time() - t0} seconds")

而对于 fast_executemany=False(这是默认值),相同的过程需要 143 秒(2.4 分钟).

whereas with fast_executemany=False (which is the default) the same process takes 143 seconds (2.4 minutes).

这篇关于使用 SQL Server 提高 pandas 的 to_sql() 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆