如何加快从 oracle sql 到 pandas df 的数据加载 [英] How to speed up loading data from oracle sql to pandas df
问题描述
我的代码看起来像这样,我使用 pd.DataFrame.from_records 将数据填充到数据帧中,但是需要 Wall time: 1h 40min 30s 来处理请求并从 sql 表加载数据将 2200 万行放入 df.
My code looks like this, i use pd.DataFrame.from_records to fill data into the dataframe, but it takes Wall time: 1h 40min 30s to process the request and load data from the sql table with 22 mln rows into df.
# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()
def db_select(query): # takes the request text and sends it to the data_frame
cur.execute(query)
col = [column[0].lower() for column in cur.description] # parse headers
df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
return df
然后我将sql查询传递给函数:
Then I pass the sql query to the function:
frame = db_select("select * from table")
如何优化代码以加快流程?
How can i optimize code for speed up process?
推荐答案
为 cur.arraysize
设置适当的值可能有助于 调整提取性能.您需要为它确定最合适的值.默认值为 100.可能会运行具有不同数组大小的代码以确定该值,例如
Setting proper value for cur.arraysize
might help for tuning fetch performance .
You need to determine the most suitable value for it. The default value is 100. A code with a different array sizes might be run in order to determine that value such as
arr=[100,1000,10000,100000,1000000]
for size in arr:
try:
cur.prefetchrows = 0
cur.arraysize = size
start = datetime.now()
cur.execute("SELECT * FROM mytable").fetchall()
elapsed = datetime.now() - start
print("Process duration for arraysize ", size," is ", elapsed, " seconds")
except Exception as err:
print("Memory Error ", err," for arraysize ", size)
然后在从原始代码调用 db_select
之前设置例如 cur.arraysize = 10000
and then set such as cur.arraysize = 10000 before calling db_select
from your original code
这篇关于如何加快从 oracle sql 到 pandas df 的数据加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!