REGEXP_SUBSTR与INSTR和SUBSTR的性能和可读性 [英] Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

查看:137
本文介绍了REGEXP_SUBSTR与INSTR和SUBSTR的性能和可读性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我的另一个问题中,将REGEXP_SUBSTR与Strings Qualifier一起使用,我正在尝试确定哪种方法更适合使用.

From My other question, Using REGEXP_SUBSTR with Strings Qualifier, I'm trying to decide which approach would be better to use.

结果数据集应仅按正确的顺序在除斜符PLE#ALL之前显示字符串. 程序包中已经存在的当前查询"是这样的(DDL和DML在帖子的底部):

The Resulting Data set should only show the strings before the delimters PLE, # and ALL in proper order. The Current Query already in the package is something like this (DDL and DML is in the Bottom of the Post):

SELECT  DATA1
      , DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1)) GET_DATA_TILL_FIRST_PLE
      , DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1)) GET_DATA_TILL_FIRST_NUM_SIGN
      , DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1)) GET_DATA_TILL_FIRST_ALL
      , NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
        DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1) PUT_THEM_ALL_TOGETHER    
FROM    table_x;    

将哪些结果存入以下数据集:

Which Results into a Data Set below:

DATA1                   | GET_DATA_TILL_FIRST_PLE | GET_DATA_TILL_FIRST_#_SIGN  | GET_DATA_TILL_FIRST_ALL    |  PUT_THEM_ALL_TOGETHER
----------------------- | ----------------------- | --------------------------- | -------------------------- |  ----------------------
STRING_EXAMPLE          | STRING_EXAM             |                             |                            |  STRING_EXAM
TREE_OF_APPLES          | TREE_OF_AP              |                             |                            |  TREE_OF_AP
FIRST_EXAMPLE           | FIRST_EXAM              |                             |                            |  FIRST_EXAM
IMPLEMENTATION          | IM                      |                             |                            |  IM
PARIS                   |                         |                             |                            |  PARIS
PLEONASM                |                         |                             |                            |  PLEONASM
XXXX 1                  |                         |                             |                            |  XXXX 1 
XXXX YYYYYY 2 FFFFFFFFF |                         |                             |                            |  XXXX YYYYYY 2 FFFFFFFFF
XXXX YYYYYY 5FFFFFFFFF  |                         |                             |                            |  XXXX YYYYYY 5FFFFFFFFF
OPOPOPOPO #09090 APPLE  | OPOPOPOPO #09090 AP     | OPOPOPOPO                   | OPOPOPOPO #                |  OPOPOPOPO #09090 AP
OPOPOPOPO BALL#         |                         | OPOPOPOPO BALL              | OPOPOPOPO B                |  OPOPOPOPO BALL
BALL IS #LIFE           |                         | BALL IS                     | B                          |  BALL IS     

PS.我只需要列PUT_THEM_ALL_TOGETHER,但我也包括其他列,并且确实添加了上下文.

PS. I only need column PUT_THEM_ALL_TOGETHER but i included the other columns as well do add context.

我发现查询有点混乱并且难以阅读,因此我尝试使用REGEXP_SUBSTR并在@vkp的建议下,提出了以下查询 会得出与上述相同的数据集.

I find the query bit confusing and hard to read so i tried using REGEXP_SUBSTR and with the suggestion of @vkp, i came up with the below query that results into the same data set above.

SELECT  DATA1
  , REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1) GET_DATA_TILL_FIRST_PLE
  , REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1) GET_DATA_TILL_FIRST_#_SIGN
  , REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1) GET_DATA_TILL_FIRST_ALL
  , COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1),
             REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1),
             DATA1) PUT_THEM_ALL_TOGETHER
FROM    table_x;     

但是,从@MathGuy的答案来看,INSTRSUBSTR似乎更有效率. 我对此进行了某种程度的测试,这是我得到的:

However, From @MathGuy's Answer, it seems that INSTR and SUBSTR is much more efficient. i tested this to some extent and here's what i got:

使用INSTRSUBSTR:

SET TIMING ON;    
BEGIN
    UPDATE  table_x
    SET     DATA2 = NVL(DECODE(SIGN(0 - instr(DATA1, 'PLE')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'PLE') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, '#')), -1, SUBSTR(DATA1, 1, instr(DATA1, '#') - 1), 0,
                    DECODE(SIGN(0 - instr(DATA1, 'ALL')), -1, SUBSTR(DATA1, 1, instr(DATA1, 'ALL') - 1), DATA1), DATA1), DATA1), DATA1);    
    ROLLBACK;        
END;
/            

PL/SQL过程成功完成.
过去时间:00:00:00.234

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.234

使用REGEXP_SUBSTR:

SET TIMING ON;  
BEGIN    
    UPDATE  table_x
    SET     DATA2 = COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1);
    ROLLBACK;        
END;
/    

PL/SQL过程成功完成.
过去时间:00:00:00.236

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.236

尽管这是非常有限的测试数据,但它表明INSTRSUBSTR的组合比REGEXP_SUBSTR快一点. 出于可读性考虑,使用REGEXP_SUBSTR代替INSTRSUBSTR可以忽略不计吗?

Although this is a very limited test data, it shows that the combination of INSTR and SUBSTR is a little bit faster than REGEXP_SUBSTR. Would it be negligible to use REGEXP_SUBSTR instead of INSTR and SUBSTR for readability's sake?

DML和DDL:

create table table_x 
(
    data1 varchar2(100)    
   ,data2 varchar2(100)
);

INSERT INTO table_x (DATA1) VALUES ('STRING_EXAMPLE');
INSERT INTO table_x (DATA1) VALUES ('TREE_OF_APPLES');
INSERT INTO table_x (DATA1) VALUES ('FIRST_EXAMPLE');  
INSERT INTO table_x (DATA1) VALUES ('IMPLEMENTATION');   
INSERT INTO table_x (DATA1) VALUES ('PARIS');            
INSERT INTO table_x (DATA1) VALUES ('PLEONASM');        

INSERT INTO table_x (DATA1) VALUES ('XXXX 1');   
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 2 FFFFFFFFF'); 
INSERT INTO table_x (DATA1) VALUES ('XXXX YYYYYY 5FFFFFFFFF'); 

INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO #09090 APPLE'); 
INSERT INTO table_x (DATA1) VALUES ('OPOPOPOPO BALL#'); 
INSERT INTO table_x (DATA1) VALUES ('BALL IS #LIFE');   

谢谢.

推荐答案

我已经发布了答案,显示了如何使用INSTRSUBSTR正确的方法来解决此问题.

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

在此答案"中,我解决了另一个问题-哪种解决方案更有效.我将在下面解释测试,但这是最重要的一点:REGEXP解决方案所花费的时间比INSTR/SUBSTR解决方案长40倍.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

设置:我创建了一个包含150万个随机字符串的表(长度均为8个字符,全部为大写字母).然后,我修改了10%的字符串以添加子字符串'PLE',另外10%的字符串添加了'#',另外10%的字符串添加了'ALL'.我是通过在位置mod(rownum, 9)处拆分原始字符串(即介于0和8之间的数字)并在该位置处连接'PLE''#''ALL'来实现的.当然,这不是获取所需测试数据的最有效或最优雅的方法,但这无关紧要-关键是创建测试数据并将其用于我们的测试中.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

所以:我们现在有一个只有一列data1的表,在150万行中有一些随机字符串.每个都有10%的子字符串包含PLE#ALL.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

测试包括像原始文章中那样创建新的字符串data2.我没有将结果插入表中;而是将结果插入表中.不管data2的计算方式如何,插入到表中的时间都应该相同.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

相反,我将主查询放入一个外部查询中,该查询计算得出的data2值的长度之和.这样,我保证优化器不能采用快捷方式:必须生成所有data2值,必须测量其长度,然后将其求和.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

下面是创建基表所需的语句,我将其称为table_z,然后运行查询.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z 
set data1 = case
when rownum between      1 and 150000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9)) 
                               || '#'   || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'ALL' || substr(data1, mod(rownum, 9) + 1)
          end
where rownum <= 450000;

commit;

INSTR/SUBSTR解决方案

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1, 
       case 
         when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
         when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
         when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
         else data1 end
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP解决方案

REGEXP solution

select sum(length(data2))
from (
select data1, 
       COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1)
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:30.75

在任何人提出这些建议之前,我都重复了两次查询;第一个解决方案的运行时间通常为0.75到0.80秒,第二个查询的运行时间为30到35秒.慢40倍以上. (因此,编译器/优化器花费时间来编译查询不是问题;这实际上是执行时间.)而且,这与从基表中读取150万个值无关,这与两种测试的时间都比处理要少得多.无论如何,我都会先运行INSTR/SUBSTR查询,因此,如果有任何缓存,那么REGEXP查询将是一个受益的地方.

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

编辑:我刚刚发现所建议的REGEXP解决方案效率低下.如果我们将搜索模式锚定到字符串的开头(例如'^(.+?)PLE',请注意^锚定),则REGEXP查询的运行时间将从30秒降至10秒.显然,Oracle实现不足以识别这种等效性,并尝试从第二个字符,第三个字符等进行搜索.执行时间仍然要长15倍; 15< 40,但仍然相差很大.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

这篇关于REGEXP_SUBSTR与INSTR和SUBSTR的性能和可读性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆