将SQL Server表缓慢加载到Pandas DataFrame中 [英] Slow loading SQL Server table into pandas DataFrame
问题描述
当使用pyodbc(主要是函数pandas.read_sql(query,pyodbc_conn))从SQL Server数据库中加载超过1000万条记录时,Pandas变得异常缓慢.以下代码最多需要40-45分钟的时间才能从SQL表中加载10-15百万条记录:Table1
Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-45 minutes to load 10-15 million records from SQL table: Table1
是否有更好,更快的方法将SQL表读入pandas Dataframe?
Is there a better and faster method to read SQL Table into pandas Dataframe?
import pyodbc
import pandas
server = <server_ip>
database = <db_name>
username = <db_user>
password = <password>
port='1443'
conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = conn.cursor()
data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete
推荐答案
我遇到了同样的问题,甚至有更多的行,大约50 M 最终编写了一个SQL查询,并将它们存储为.h5文件.
I had a same problem with even more number of rows, ~50 M Ended up writing a SQL query and stored them as .h5 files.
sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)
hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]
for chunk in sql_reader:
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
这样,我们将能够比Pandas.read_csv
This way, we'll be able to read them faster than a Pandas.read_csv
这篇关于将SQL Server表缓慢加载到Pandas DataFrame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!