使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录),使sql执行非常缓慢 [英] using pd.read_sql() to extract large data (>5 million records) from oracle database, making the sql execution very slow

查看:182
本文介绍了使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录),使sql执行非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 最初尝试使用pd.read_sql().
  2. 然后我尝试使用sqlalchemy,查询对象,但是这些方法都不是有用,因为sql可以长时间执行,并且永无止境.
  3. 我尝试使用提示.
  4. 我猜问题出在下面:Pandas在背景.使用cx_Oracle,我们无法影响"arraysize"参数因此将被使用,即始终使用默认值100太小了.

  1. Initially tried using pd.read_sql().
  2. Then I tried using sqlalchemy, query objects but none of these methods are useful as the sql getting executed for long time and it never ends.
  3. I tried using Hints.
  4. I guess the problem is the following: Pandas creates a cursor object in the background. With cx_Oracle we cannot influence the "arraysize" parameter which will be used thereby, i.e. always the default value of 100 will be used which is far too small.

CODE:

  import pandas as pd
  import Configuration.Settings as CS
  import DataAccess.Databases as SDB
  import sqlalchemy
  import cx_Oracle

  dfs = []

  DBM = SDB.Database(CS.DB_PRM,PrintDebugMessages=False,ClientInfo="Loader")

  sql = '''
                              WITH            
                                  l AS
                                (
                                SELECT DISTINCT /*+ materialize */
                                  hcz.hcz_lwzv_id AS lwzv_id

                                FROM
                                  pm_mbt_materialbasictypes mbt
                                  INNER JOIN pm_mpt_materialproducttypes mpt ON mpt.mpt_mbt_id = mbt.mbt_id
                                  INNER JOIN pm_msl_materialsublots msl ON msl.msl_mpt_id = mpt.mpt_id
                                  INNER JOIN pm_historycompattributes hca ON hca.hca_msl_id = msl.msl_id AND hca.hca_ignoreflag = 0 
                                  INNER JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_id = hca.hca_tpm_id 
                                  inner join pm_tin_testdefinsertions tin on tin.tin_id = tpm.tpm_tin_id             
                                  INNER JOIN pm_hcz_history_comp_zones hcz ON hcz.hcz_hcp_id = hca.hca_hcp_id
                                WHERE
                                  mbt.mbt_name = :input1 and tin.tin_name = 'x1' and
                                 hca.hca_testendday < '2018-5-31' and hca.hca_testendday > '2018-05-30'                  
                                ),
                            TPL as 
                                (
                                select /*+ materialize */
                                  *
                                  from
                                  (
                                  select
                                    ut.ut_id,
                                    ut.ut_basic_type,
                                    ut.ut_insertion,
                                    ut.ut_testprogram_name,
                                    ut.ut_revision  
                                  from
                                    pm_updated_testprogram ut
                                  where 
                                    ut.ut_basic_type = :input1 and ut.ut_insertion = :input2 
                                  order by
                                    ut.ut_revision desc  
                                  ) where rownum = 1
                                )

                                SELECT /*+ FIRST_ROWS */
                                  rcl.rcl_lotidentifier                                           AS LOT, 
                                  lwzv.lwzv_wafer_id                                              AS WAFER,
                                  pzd.pzd_zone_name                                               AS ZONE,
                                  tte.tte_tpm_id||'~'||tte.tte_testnumber||'~'||tte.tte_testname  AS Test_Identifier,
                                  case when ppd.ppd_measurement_result > 1e15 then NULL else SFROUND(ppd.ppd_measurement_result,6) END AS Test_Results

                               FROM
                                  TPL 
                                  left JOIN pm_pcm_details pcm on pcm.pcm_ut_id = TPL.ut_id 
                                  left JOIN pm_tin_testdefinsertions tin ON tin.tin_name = TPL.ut_insertion 
                                  left JOIN pm_tpr_testdefprograms tpr ON tpr.tpr_name = TPL.ut_testprogram_name and tpr.tpr_revision = TPL.ut_revision
                                  left JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_tpr_id = tpr.tpr_id and tpm.tpm_tin_id = tin.tin_id 
                                  left JOIN pm_tte_testdeftests tte on tte.tte_tpm_id = tpm.tpm_id and tte.tte_testnumber = pcm.pcm_testnumber 
                                  cross join l 
                                  left JOIN pm_lwzv_info lwzv ON lwzv.lwzv_id = l.lwzv_id 
                                  left JOIN pm_rcl_resultschipidlots rcl ON rcl.rcl_id = lwzv.lwzv_rcl_id                                   
                                  left JOIN pm_pcm_zone_def pzd ON pzd.pzd_basic_type = TPL.ut_basic_type and pzd.pzd_pcm_x = lwzv.lwzv_pcm_x and pzd.pzd_pcm_y = lwzv.lwzv_pcm_y 
                                  left JOIN pm_pcm_par_data ppd ON ppd.ppd_lwzv_id = l.lwzv_id and ppd.ppd_tte_id = tte.tte_id

    '''
#method1: using query objects.

Q = DBM.getQueryObject(sql)
Q.execute({"input1":'xxxx',"input2":'yyyy'})

while not Q.AtEndOfResultset:
  print Q

#method2: using sqlalchemy

 connectstring = "oracle+cx_oracle://username:Password@(description= 
 (address_list=(address=(protocol=tcp)(host=tnsconnect string) 
 (port=pertnumber)))(connect_data=(sid=xxxx)))"
 engine = sqlalchemy.create_engine(connectstring, arraysize=10000)
 df_p = pd.read_sql(sql, params= 
 {"input1":'xxxx',"input2":'yyyy'}, con=engine)

#method3: using pd.read_sql()

df_p = pd.read_sql_query(SQL_PCM, params= 
{"input1":'xxxx',"input2":'yyyy'}, 
coerce_float=True, con= DBM.Connection)

如果有人可以帮助我,那就太好了.预先感谢.

It would be great if some one could help me out in this. Thanks in advance.

推荐答案

还有另一种可能的方法来调整数组大小,而无需创建Chris建议的oraaccess.xml.这可能无法直接与其余代码一起使用,但是如果您想尝试这种方法,应该可以使您知道如何进行操作!

And yet another possibility to adjust the array size without needing to create oraaccess.xml as suggested by Chris. This may not work with the rest of your code as is, but it should give you an idea of how to proceed if you wish to try this approach!

class Connection(cx_Oracle.Connection):

    def __init__(self):
        super(Connection, self).__init__("user/pw@dsn")

    def cursor(self):
        c = super(Connection, self).cursor()
        c.arraysize = 5000
        return c

engine = sqlalchemy.create_engine(creator=Connection)
pandas.read_sql(sql, engine)

这篇关于使用pd.read_sql()从oracle数据库中提取大数据(> 500万条记录),使sql执行非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆