使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录)，使sql执行非常缓慢 [英] using pd.read_sql() to extract large data (>5 million records) from oracle database, making the sql execution very slow

查看：182 发布时间：2021/4/27 20:48:27 python-2.7 cx-oracle pandasql

本文介绍了使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录)，使sql执行非常缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最初尝试使用pd.read_sql().
然后我尝试使用sqlalchemy，查询对象，但是这些方法都不是有用，因为sql可以长时间执行，并且永无止境.
我尝试使用提示.
我猜问题出在下面:Pandas在背景.使用cx_Oracle，我们无法影响"arraysize"参数因此将被使用，即始终使用默认值100太小了.

Initially tried using pd.read_sql().
Then I tried using sqlalchemy, query objects but none of these methods are useful as the sql getting executed for long time and it never ends.
I tried using Hints.
I guess the problem is the following: Pandas creates a cursor object in the background. With cx_Oracle we cannot influence the "arraysize" parameter which will be used thereby, i.e. always the default value of 100 will be used which is far too small.

CODE:

  import pandas as pd
  import Configuration.Settings as CS
  import DataAccess.Databases as SDB
  import sqlalchemy
  import cx_Oracle

  dfs = []

  DBM = SDB.Database(CS.DB_PRM,PrintDebugMessages=False,ClientInfo="Loader")

  sql = '''
                              WITH            
                                  l AS
                                (
                                SELECT DISTINCT /*+ materialize */
                                  hcz.hcz_lwzv_id AS lwzv_id

                                FROM
                                  pm_mbt_materialbasictypes mbt
                                  INNER JOIN pm_mpt_materialproducttypes mpt ON mpt.mpt_mbt_id = mbt.mbt_id
                                  INNER JOIN pm_msl_materialsublots msl ON msl.msl_mpt_id = mpt.mpt_id
                                  INNER JOIN pm_historycompattributes hca ON hca.hca_msl_id = msl.msl_id AND hca.hca_ignoreflag = 0 
                                  INNER JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_id = hca.hca_tpm_id 
                                  inner join pm_tin_testdefinsertions tin on tin.tin_id = tpm.tpm_tin_id             
                                  INNER JOIN pm_hcz_history_comp_zones hcz ON hcz.hcz_hcp_id = hca.hca_hcp_id
                                WHERE
                                  mbt.mbt_name = :input1 and tin.tin_name = 'x1' and
                                 hca.hca_testendday < '2018-5-31' and hca.hca_testendday > '2018-05-30'                  
                                ),
                            TPL as 
                                (
                                select /*+ materialize */
                                  *
                                  from
                                  (
                                  select
                                    ut.ut_id,
                                    ut.ut_basic_type,
                                    ut.ut_insertion,
                                    ut.ut_testprogram_name,
                                    ut.ut_revision  
                                  from
                                    pm_updated_testprogram ut
                                  where 
                                    ut.ut_basic_type = :input1 and ut.ut_insertion = :input2 
                                  order by
                                    ut.ut_revision desc  
                                  ) where rownum = 1
                                )

                                SELECT /*+ FIRST_ROWS */
                                  rcl.rcl_lotidentifier                                           AS LOT, 
                                  lwzv.lwzv_wafer_id                                              AS WAFER,
                                  pzd.pzd_zone_name                                               AS ZONE,
                                  tte.tte_tpm_id||'~'||tte.tte_testnumber||'~'||tte.tte_testname  AS Test_Identifier,
                                  case when ppd.ppd_measurement_result > 1e15 then NULL else SFROUND(ppd.ppd_measurement_result,6) END AS Test_Results

                               FROM
                                  TPL 
                                  left JOIN pm_pcm_details pcm on pcm.pcm_ut_id = TPL.ut_id 
                                  left JOIN pm_tin_testdefinsertions tin ON tin.tin_name = TPL.ut_insertion 
                                  left JOIN pm_tpr_testdefprograms tpr ON tpr.tpr_name = TPL.ut_testprogram_name and tpr.tpr_revision = TPL.ut_revision
                                  left JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_tpr_id = tpr.tpr_id and tpm.tpm_tin_id = tin.tin_id 
                                  left JOIN pm_tte_testdeftests tte on tte.tte_tpm_id = tpm.tpm_id and tte.tte_testnumber = pcm.pcm_testnumber 
                                  cross join l 
                                  left JOIN pm_lwzv_info lwzv ON lwzv.lwzv_id = l.lwzv_id 
                                  left JOIN pm_rcl_resultschipidlots rcl ON rcl.rcl_id = lwzv.lwzv_rcl_id                                   
                                  left JOIN pm_pcm_zone_def pzd ON pzd.pzd_basic_type = TPL.ut_basic_type and pzd.pzd_pcm_x = lwzv.lwzv_pcm_x and pzd.pzd_pcm_y = lwzv.lwzv_pcm_y 
                                  left JOIN pm_pcm_par_data ppd ON ppd.ppd_lwzv_id = l.lwzv_id and ppd.ppd_tte_id = tte.tte_id

    '''
#method1: using query objects.

Q = DBM.getQueryObject(sql)
Q.execute({"input1":'xxxx',"input2":'yyyy'})

while not Q.AtEndOfResultset:
  print Q

#method2: using sqlalchemy

 connectstring = "oracle+cx_oracle://username:Password@(description= 
 (address_list=(address=(protocol=tcp)(host=tnsconnect string) 
 (port=pertnumber)))(connect_data=(sid=xxxx)))"
 engine = sqlalchemy.create_engine(connectstring, arraysize=10000)
 df_p = pd.read_sql(sql, params= 
 {"input1":'xxxx',"input2":'yyyy'}, con=engine)

#method3: using pd.read_sql()

df_p = pd.read_sql_query(SQL_PCM, params= 
{"input1":'xxxx',"input2":'yyyy'}, 
coerce_float=True, con= DBM.Connection)

如果有人可以帮助我，那就太好了.预先感谢.

It would be great if some one could help me out in this. Thanks in advance.

使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录)，使sql执行非常缓慢 [英] using pd.read_sql() to extract large data (>5 million records) from oracle database, making the sql execution very slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录)，使sql执行非常缓慢 [英] using pd.read_sql() to extract large data (&gt;5 million records) from oracle database, making the sql execution very slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录)，使sql执行非常缓慢 [英] using pd.read_sql() to extract large data (>5 million records) from oracle database, making the sql execution very slow

登录关闭