在10分钟内在Oracle中插入1千万个查询? [英] INSERT of 10 million queries under 10 minutes in Oracle?

查看:93
本文介绍了在10分钟内在Oracle中插入1千万个查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究文件加载器程序.

I am working on a file loader program.

该程序的目的是获取一个输入文件,对其数据进行一些转换,然后将数据上传到Oracle数据库中.

The purpose of this program is to take an input file, do some conversions on its data and then upload the data into the database of Oracle.

我面临的问题是我需要优化在Oracle上非常大的输入数据的插入.

The problem that I am facing is that I need to optimize the insertion of very large input data on Oracle.

我正在将数据上传到表格中,比方说ABC.

I am uploading data into the table, lets say ABC.

我正在C ++程序中使用Oracle提供的OCI库. 具体来说,我使用OCI连接池进行多线程并加载到ORACLE中. ( http://docs.oracle.com/cd/B28359_01/appdev.111/b28395/oci09adv.htm )

I am using the OCI library provided by Oracle in my C++ Program. In specific, I am using OCI Connection Pool for multi-threading and loading into ORACLE. (http://docs.oracle.com/cd/B28359_01/appdev.111/b28395/oci09adv.htm )

以下是用于创建表ABC的DDL语句–

The following are the DDL statements that have been used to create the table ABC –

CREATE TABLE ABC(
   seq_no         NUMBER NOT NULL,
   ssm_id         VARCHAR2(9)  NOT NULL,
   invocation_id  VARCHAR2(100)  NOT NULL,
   analytic_id    VARCHAR2(100) NOT NULL,
   analytic_value NUMBER NOT NULL,
   override       VARCHAR2(1)  DEFAULT  'N'   NOT NULL,
   update_source  VARCHAR2(255) NOT NULL,
   last_chg_user  CHAR(10)  DEFAULT  USER NOT NULL,
   last_chg_date  TIMESTAMP(3) DEFAULT  SYSTIMESTAMP NOT NULL
);

CREATE UNIQUE INDEX ABC_indx ON ABC(seq_no, ssm_id, invocation_id, analytic_id);
/
CREATE SEQUENCE ABC_seq;
/

CREATE OR REPLACE TRIGGER ABC_insert
BEFORE INSERT ON ABC
FOR EACH ROW
BEGIN
SELECT ABC_seq.nextval INTO :new.seq_no FROM DUAL;
END;

我当前正在使用以下查询模式将数据上传到数据库中.我正在通过OCI连接池的各个线程分批发送500个查询数据.

I am currently using the following Query pattern to upload the data into the database. I am sending data in batches of 500 queries via various threads of OCI connection pool.

使用的SQL插入查询示例-

Sample of SQL insert query used -

insert into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source)
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual

Oracle针对上述查询的执行计划-

EXECUTION PLAN by Oracle for the above query -

-----------------------------------------------------------------------------
| Id  | Operation                | Name|Rows| Cost (%CPU) | Time     |
-----------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |     | 4  |     8   (0) | 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | ABC |    |             |          |
|   2 |   UNION-ALL              |     |    |             |          |
|   3 |    FAST DUAL             |     | 1  |     2   (0) | 00:00:01 |
|   4 |    FAST DUAL             |     | 1  |     2   (0) | 00:00:01 |
|   5 |    FAST DUAL             |     | 1  |     2   (0) | 00:00:01 |
|   6 |    FAST DUAL             |     | 1  |     2   (0) | 00:00:01 |

程序运行一百万行的运行时间-

The Run times of the program loading 1 million lines -

Batch Size = 500
Number of threads - Execution Time -
10                  4:19
20                  1:58
30                  1:17
40                  1:34
45                  2:06
50                  1:21
60                  1:24
70                  1:41
80                  1:43
90                  2:17
100                 2:06


Average Run Time = 1:57    (Roughly 2 minutes)

我需要进一步优化和减少时间.我面临的问题是当我放置1000万行用于上传时.

I need to optimize and reduce this time further. The problem that I am facing is when I put 10 million rows for uploading.

1000万的平均运行时间为= 21分钟

The average run time for 10 million came out to be = 21 minutes

(我的目标是将此时间减少到10分钟以下)

所以我也尝试了以下步骤-

So I tried the following steps as well -

[1] 是否在 seq_no 的基础上对表ABC进行了分区. 已使用 30个分区. 经过一百万行的测试-性能非常差.几乎是未分区表的 4倍.

[1] Did the partitioning of the table ABC on the basis of seq_no. Used 30 partitions. Tested with 1 million rows - The performance was very poor. almost 4 times more than the unpartitioned table.

[2] 根据 last_chg_date 对表ABC进行的另一种分区. 已使用的 30个分区.

[2] Another partitioning of the table ABC on the basis of last_chg_date. Used 30 partitions.

2.a)经过100万行测试-性能几乎等于未分区表.差异很小,因此未考虑.

2.a) Tested with 1 million rows - The performance was almost equal to the unpartitioned table. Very little difference was there so it was not considered.

2.b)再次对1000万行进行了相同的测试.性能几乎等于未分区表.没有明显的区别.

2.b) Again tested the same with 10 million rows. The performance was almost equal to the unpartitioned table. No noticable difference.

以下是DDL命令用于实现分区-

The following was the DDL commands were used to achieve partitioning -

CREATE TABLESPACE ts1 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts2 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts3 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts4 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts5 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts6 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts7 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts8 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts9 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts10 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts11 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts12 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts13 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts14 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts15 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts16 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts17 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts18 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts19 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts20 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts21 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts22 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts23 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts24 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts25 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts26 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts27 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts28 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts29 DATAFILE AUTOEXTEND ON;
CREATE TABLESPACE ts30 DATAFILE AUTOEXTEND ON;

CREATE TABLE ABC(
   seq_no           NUMBER NOT NULL,
   ssm_id           VARCHAR2(9)  NOT NULL,
   invocation_id    VARCHAR2(100)  NOT NULL,
   calc_id          VARCHAR2(100) NULL,
   analytic_id      VARCHAR2(100) NOT NULL,
   ANALYTIC_VALUE   NUMBER NOT NULL,
   override         VARCHAR2(1)  DEFAULT  'N'   NOT NULL,
   update_source    VARCHAR2(255) NOT NULL,
   last_chg_user    CHAR(10)  DEFAULT  USER NOT NULL,
   last_chg_date    TIMESTAMP(3) DEFAULT  SYSTIMESTAMP NOT NULL
)
PARTITION BY HASH(last_chg_date)
PARTITIONS 30
STORE IN (ts1, ts2, ts3, ts4, ts5, ts6, ts7, ts8, ts9, ts10, ts11, ts12, ts13,
ts14, ts15, ts16, ts17, ts18, ts19, ts20, ts21, ts22, ts23, ts24, ts25, ts26,
ts27, ts28, ts29, ts30);

我在使用OCI的线程函数(用C ++编写)中使用的代码-

CODE that I am using in the thread function (written in C++), using OCI -

void OracleLoader::bulkInsertThread(std::vector<std::string> const & statements)
{

    try
    {
        INFO("ORACLE_LOADER_THREAD","Entered Thread = %1%", m_env);
        string useOraUsr = "some_user";
        string useOraPwd = "some_password";

        int user_name_len   = useOraUsr.length();
        int passwd_name_len = useOraPwd.length();

        text* username((text*)useOraUsr.c_str());
        text* password((text*)useOraPwd.c_str());


        if(! m_env)
        {
            CreateOraEnvAndConnect();
        }
        OCISvcCtx *m_svc = (OCISvcCtx *) 0;
        OCIStmt *m_stm = (OCIStmt *)0;

        checkerr(m_err,OCILogon2(m_env,
                                 m_err,
                                 &m_svc,
                                 (CONST OraText *)username,
                                 user_name_len,
                                 (CONST OraText *)password,
                                 passwd_name_len,
                                 (CONST OraText *)poolName,
                                 poolNameLen,
                                 OCI_CPOOL));

        OCIHandleAlloc(m_env, (dvoid **)&m_stm, OCI_HTYPE_STMT, (size_t)0, (dvoid **)0);

////////// Execution Queries in the format of - /////////////////
//        insert into pm_own.sec_analytics (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value, override, update_source)
//        select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
//        union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
//        union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
//        union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual
//////////////////////////////////////////////////////////////////

        size_t startOffset = 0;
        const int batch_size = PCSecAnalyticsContext::instance().getBatchCount();
        while (startOffset < statements.size())
        {
            int remaining = (startOffset + batch_size < statements.size() ) ? batch_size : (statements.size() - startOffset );
            // Break the query vector to meet the batch size
            std::vector<std::string> items(statements.begin() + startOffset,
                                           statements.begin() + startOffset + remaining);

            //! Preparing the Query
            std::string insert_query = "insert into ";
            insert_query += Context::instance().getUpdateTable();
            insert_query += " (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value, override, update_source)\n";

            std::vector<std::string>::const_iterator i3 = items.begin();
            insert_query += *i3 ;

            for( i3 = items.begin() + 1; i3 != items.end(); ++i3)
                insert_query += "union " + *i3 ;
            // Preparing the Statement and Then Executing it in the next step
            text *txtQuery((text *)(insert_query).c_str());
            checkerr(m_err, OCIStmtPrepare (m_stm, m_err, txtQuery, strlen((char *)txtQuery), OCI_NTV_SYNTAX, OCI_DEFAULT));
            checkerr(m_err, OCIStmtExecute (m_svc, m_stm, m_err, (ub4)1, (ub4)0, (OCISnapshot *)0, (OCISnapshot *)0, OCI_DEFAULT ));

            startOffset += batch_size;
        }

        // Here is the commit statement. I am committing at the end of each thread.
        checkerr(m_err, OCITransCommit(m_svc,m_err,(ub4)0));

        checkerr(m_err, OCIHandleFree((dvoid *) m_stm, OCI_HTYPE_STMT));
        checkerr(m_err, OCILogoff(m_svc, m_err));

        INFO("ORACLE_LOADER_THREAD","Thread Complete. Leaving Thread.");
    }

    catch(AnException &ex)
    {
        ERROR("ORACLE_LOADER_THREAD", "Oracle query failed with : %1%", std::string(ex.what()));
        throw AnException(string("Oracle query failed with : ") + ex.what());
    }
}

在回答帖子的同时,建议我几种方法来优化我的插入查询. 我选择并使用程序中的 QUERY I 是由于以下原因,这些原因是我在测试各种INSERT查询时发现的. 在运行向我建议的SQL查询时- 查询1-

While the post was being answered, I was suggested several methods to optimize my INSERT QUERY. I have chosen and used QUERY I in my program for the following reasons that I discovered while testing the various INSERT Queries. On running the SQL Queries that were suggested to me - QUERY I -

insert into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source)
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual

Oracle针对查询I的执行计划-

--------------------------------------------------------------------------
| Id  | Operation                | Name| Rows | Cost (%CPU)   | Time     |
--------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |     |  4   | 8   (0)       | 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | ABC |      |               |          |
|   2 |   UNION-ALL              |     |      |               |          |
|   3 |    FAST DUAL             |     |  1   | 2   (0)       | 00:00:01 |
|   4 |    FAST DUAL             |     |  1   | 2   (0)       | 00:00:01 |
|   5 |    FAST DUAL             |     |  1   | 2   (0)       | 00:00:01 |
|   6 |    FAST DUAL             |     |  1   | 2   (0)       | 00:00:01 |

QUERY II-

insert all
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','b',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','e',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','r',NULL, 'test', 123 , 'N', 'asdf')
into ABC (SSM_ID, invocation_id , calc_id, analytic_id, analytic_value,
override, update_source) values ('c','t',NULL, 'test', 123 , 'N', 'asdf')
select 1 from dual

Oracle for Query II的执行计划-

-----------------------------------------------------------------------------
| Id  | Operation           | Name| Rows  | Cost (%CPU)   | Time     |
-----------------------------------------------------------------------------
|   0 | INSERT STATEMENT    |     | 1     |     2   (0)   | 00:00:01 |
|   1 |  MULTI-TABLE INSERT |     |       |               |          |
|   2 |   FAST DUAL         |     | 1     |     2   (0)   | 00:00:01 |
|   3 |   INTO              | ABC |       |               |          |
|   4 |   INTO              | ABC |       |               |          |
|   5 |   INTO              | ABC |       |               |          |
|   6 |   INTO              | ABC |       |               |          |

根据实验,查询I更快.

在这里,我在Oracle SQL Developer上都进行了测试,并且我也通过C ++程序(FILELOADER)发送了插入查询.

Here I tested on both Oracle SQL Developer , as well as I sent insert queries by my C++ program (FILELOADER) as well.

在进一步阅读它时,我发现执行计划显示的成本是查询将用于处理自身的CPU数量. 这表明Oracle将使用更多的CPU来处理第一个查询,这就是为什么其成本继续为= 8的原因.

On Further reading about it, I found out that the cost shown by the Execution Plan is the number of CPU the query will use to process itself. That tells that Oracle will use more CPU to process the first query and that is why its cost goes on to be = 8.

即使通过我的应用程序使用相同的插入模式,我也发现它的性能几乎提高了1.5倍.

Even by using the same insert pattern via my application, I found out that its performance it almost 1.5 times better.

我需要一些关于如何进一步提高性能的见解. 我尝试过的所有事情,都在我的问题中进行了总结. 如果我发现或发现任何相关的内容,将添加到该问题中.

I need some insight on how I can improve the performance even further..? All the things that I have tried, I have summarized them in my question. If I find or discover anything relevant, I will add to this question.

我的目标是将一千万个查询的上传时间控制在10分钟以内.

推荐答案

我知道其他人已经提到了这一点,并且您不想听到它,但是请使用

I know others have mentioned this and you don't want to hear it but use SQL*Loader or external tables. My average load time for tables of approximately the same width is 12.57 seconds for just over 10m rows. These utilities have been explicitly designed to load data into the database quickly and are pretty good at it. This may incur some additional time penalties depending on the format of your input file, but there are quite a few options and I've rarely had to change files prior to loading.

如果您不愿意这样做,则不必升级硬件.您需要消除所有可能的障碍以快速加载该文件.要枚举它们,请删除:

If you're unwilling to do this then you don't have to upgrade your hardware yet; you need to remove every possible impediment to loading this quickly. To enumerate them, remove:

  1. 索引
  2. 触发器
  3. 序列
  4. 分区

所有这些都使数据库有义务执行更多的工作,并且由于您是事务性地进行此操作,因此您并未充分利用数据库的潜力.

With all of these you're obliging the database to perform more work and because you're doing this transactionally, you're not using the database to its full potential.

将数据加载到单独的表中,例如ABC_LOAD.数据完全加载后,在ABC中执行单个 INSERT语句.

Load the data into a separate table, say ABC_LOAD. After the data has been completely loaded perform a single INSERT statement into ABC.

insert into abc
select abc_seq.nextval, a.*
  from abc_load a

执行此操作时(即使您不这样做),请确保序列高速缓存大小正确;否则,请执行以下操作. 引用:

When you do this (and even if you don't) ensure that the sequence cache size is correct; to quote:

当应用程序访问序列缓存中的序列时, 序列号可以快速读取.但是,如果应用程序访问 不在缓存中的序列,则必须读取该序列 在使用序列号之前从磁盘到缓存.

When an application accesses a sequence in the sequence cache, the sequence numbers are read quickly. However, if an application accesses a sequence that is not in the cache, then the sequence must be read from disk to the cache before the sequence numbers are used.

如果您的应用程序同时使用许多序列,则您的 序列缓存可能不足以容纳所有序列.在 在这种情况下,访问序列号通常可能需要读取磁盘. 为了快速访问所有序列,请确保您的缓存足够 条目来保存您的并发使用的所有序列 应用程序.

If your applications use many sequences concurrently, then your sequence cache might not be large enough to hold all the sequences. In this case, access to sequence numbers might often require disk reads. For fast access to all sequences, be sure your cache has enough entries to hold all the sequences used concurrently by your applications.

这意味着,如果您有10个线程使用此序列同时写入500条记录,那么您需要的缓存大小为5,000. ALTER SEQUENCE 文档说明了如何更改此设置:

This means that if you have 10 threads concurrently writing 500 records each using this sequence then you need a cache size of 5,000. The ALTER SEQUENCE document states how to change this:

alter sequence abc_seq cache 5000

如果您按照我的建议,我会将缓存大小提高到10.5m左右.

If you follow my suggestion I'd up the cache size to something around 10.5m.

使用 APPEND提示 (另请参见Oracle Base);这指示Oracle使用直接路径插入,该插入将数据直接附加到表的末尾,而不是寻找放置它的空间.如果表具有索引,则将无法使用此功能,但可以在ABC_LOAD

Look into using the APPEND hint (see also Oracle Base); this instructs Oracle to use a direct-path insert, which appends data directly to the end of the table rather than looking for space to put it. You won't be able to use this if your table has indexes but you could use it in ABC_LOAD

insert /*+ append */ into ABC (SSM_ID, invocation_id , calc_id, ... )
select 'c','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'a','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'b','b',NULL, 'test', 123 , 'N', 'asdf' from dual
union all select 'c','g',NULL, 'test', 123 , 'N', 'asdf' from dual

如果使用APPEND提示;我将在 TRUNCATE ABC_LOAD之后添加您已插入ABC,否则此表将无限期增长.这应该是安全的,因为届时您将使用完该表.

If you use the APPEND hint; I'd add TRUNCATE ABC_LOAD after you've inserted into ABC otherwise this table will grow indefinitely. This should be safe as you will have finished using the table by then.

您没有提到您使用的是哪个版本或Oracle.您可以使用许多额外的小技巧:

You don't mention what version or edition or Oracle you're using. There are a number of extra little tricks you can use:

  • Oracle 12c

此版本支持身份列;您可以完全摆脱顺序.

This version supports identity columns; you could get rid of the sequence completely.

CREATE TABLE ABC(
   seq_no         NUMBER GENERATED AS IDENTITY (increment by 5000)

  • Oracle 11g r2

    如果保留触发器;您可以直接分配序列值.

    If you keep the trigger; you can assign the sequence value directly.

    :new.seq_no := ABC_seq.nextval;
    

  • Oracle企业版

    如果使用的是Oracle Enterprise,则可以使用ABC_LOAD加速INSERT. htm#SQLRF50801d"rel =" noreferrer>并行提示:

    If you're using Oracle Enterprise you can speed up the INSERT from ABC_LOAD by using the PARALLEL hint:

    insert /*+ parallel */ into abc
    select abc_seq.nextval, a.*
      from abc_load a
    

    这可能会导致自身的问题(太多并行进程等),请进行测试.它可能会为较小的批处理插入提供帮助,但不太可能,因为您将浪费时间来计算哪个线程应该处理什么.

    This can cause it's own problems (too many parallel processes etc), so test. It might help for the smaller batch inserts but it's less likely as you'll lose time computing what thread should process what.

    使用数据库随附的实用程序.

    Use the utilities that come with the database.

    如果您无法使用它们,那么请摆脱所有可能减慢插入速度并批量进行插入操作的原因,因为这正是数据库的优势所在.

    If you can't use them then get rid of everything that might slow the insert down and do it in bulk, 'cause that's what the database is good at.

    这篇关于在10分钟内在Oracle中插入1千万个查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆