在正则表达式中使用替换是否比后续替换更快 [英] Is it faster to use alternation than subsequent replacements in regular expressions

查看:130
本文介绍了在正则表达式中使用替换是否比后续替换更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很简单的问题.在我工作的地方,我看到了很多正则表达式.在Perl中使用它们来替换和/或消除文本中的某些字符串,例如:

I have quite a straightforward question. Where I work I see a lot of regular expressions come by. They are used in Perl to get replace and/or get rid of some strings in text, e.g.:

$string=~s/^.+\///;
$string=~s/\.shtml//;
$string=~s/^ph//;

我知道您不能将第一个和最后一个替换串联起来,因为您可能只想在执行第一次替换后在字符串的开头替换ph.但是,我将第一个和第二个正则表达式与交替形式放在一起:$string=~s/(^.+\/|\.shtml)//;因为我们正在处理成千上万个文件(+500,000个),所以我想知道哪种方法最有效.

I understand that you cannot concatenate the first and last replacement, because you may only want to replace ph at the beginning of the string after you did the first replacement. However, I would put the first and second regex together with alternation: $string=~s/(^.+\/|\.shtml)//; Because we're processing thousands of files (+500,000) I was wondering which method is the most efficient.

推荐答案

您的表达式不相同

此:

$string=~s/^.+\///;
$string=~s/\.shtml//;

替换文本.shtml 直到最后一个斜杠(包括最后一个斜杠)的所有内容.

replaces the text .shtml and everything up to and including the last slash.

此:

$string=~s/(^.+\/|\.shtml)//;

替换 文本.shtml 直到最后一个斜杠的所有内容.

replaces either the text .shtml or everything up to and including the last slash.

这是组合正则表达式的一个问题:单个复杂的正则表达式比几个简单的正则表达式更难编写,更难以理解和调试.

This is one problem with combining regexes: a single complex regex is harder to write, harder to understand, and harder to debug than several simple ones.

即使您的表达式 是等效的,使用一个或另一个表达式也不会对程序的速度产生重大影响.像s///这样的内存中操作比文件I/O显着快得多,并且您已表明要执行大量文件I/O.

Even if your expressions were equivalent, using one or the other probably wouldn't have a significant impact on your program's speed. In-memory operations like s/// are significantly faster than file I/O, and you've indicated that you're doing a lot of file I/O.

您应该使用 Devel :: NYTProf 之类的文件来配置您的应用程序,以查看这些内容是否特殊替代实际上是一个瓶颈(我怀疑是这样).不要浪费您的时间来优化已经非常快的功能.

You should profile your application with something like Devel::NYTProf to see if these particular substitutions are actually a bottleneck (I doubt they are). Don't waste your time optimizing things that are already fast.

请记住,您正在比较苹果和橙子,但是如果您仍然对性能感到好奇,则可以使用

Keep in mind that you're comparing apples and oranges, but if you're still curious about performance, you can see how perl evaluates a particular regex using the re pragma:

$ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;'
...
Guessing start of match in sv for REx "^.+/" against "foobar"
Did not find floating substr "/"...
Match rejected by optimizer
Guessing start of match in sv for REx "\.shtml" against "foobar"
Did not find anchored substr ".shtml"...
Match rejected by optimizer
Freeing REx: "^.+/"
Freeing REx: "\.shtml"

正则表达式引擎具有优化器.优化器搜索必须出现在目标字符串中的子字符串.如果找不到这些子字符串,则匹配将立即失败,而不检查正则表达式的其他部分.

The regex engine has an optimizer. The optimizer searches for substrings that must appear in the target string; if these substrings can't be found, the match fails immediately, without checking the other parts of the regex.

对于/^.+\//,优化器知道$string必须包含至少一个斜线才能匹配;如果找不到任何斜杠,它将立即拒绝该匹配,而无需调用完整的正则表达式引擎. /\.shtml/也会发生类似的优化.

With /^.+\//, the optimizer knows that $string must contain at least one slash in order to match; when it finds no slashes, it rejects the match immediately without invoking the full regex engine. A similar optimization occurs with /\.shtml/.

以下是perl对组合正则表达式的作用:

Here's what perl does with the combined regex:

$ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;'
...
Matching REx "(?:^.+/|\.shtml)" against "foobar"
   0 <> <foobar>             |  1:BRANCH(7)
   0 <> <foobar>             |  2:  BOL(3)
   0 <> <foobar>             |  3:  PLUS(5)
                                    REG_ANY can match 6 times out of 2147483647...
                                    failed...
   0 <> <foobar>             |  7:BRANCH(11)
   0 <> <foobar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   1 <f> <oobar>             |  1:BRANCH(7)
   1 <f> <oobar>             |  2:  BOL(3)
                                    failed...
   1 <f> <oobar>             |  7:BRANCH(11)
   1 <f> <oobar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   2 <fo> <obar>             |  1:BRANCH(7)
   2 <fo> <obar>             |  2:  BOL(3)
                                    failed...
   2 <fo> <obar>             |  7:BRANCH(11)
   2 <fo> <obar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   3 <foo> <bar>             |  1:BRANCH(7)
   3 <foo> <bar>             |  2:  BOL(3)
                                    failed...
   3 <foo> <bar>             |  7:BRANCH(11)
   3 <foo> <bar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   4 <foob> <ar>             |  1:BRANCH(7)
   4 <foob> <ar>             |  2:  BOL(3)
                                    failed...
   4 <foob> <ar>             |  7:BRANCH(11)
   4 <foob> <ar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   5 <fooba> <r>             |  1:BRANCH(7)
   5 <fooba> <r>             |  2:  BOL(3)
                                    failed...
   5 <fooba> <r>             |  7:BRANCH(11)
   5 <fooba> <r>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
Match failed
Freeing REx: "(?:^.+/|\.shtml)"

请注意输出的长度.由于这种交替,优化器无法启动,并且会执行完整的正则表达式引擎.在最坏的情况下(没有匹配项),将针对字符串中的每个字符测试交替的每个部分.这不是很有效.

Notice how much longer the output is. Because of the alternation, the optimizer doesn't kick in and the full regex engine is executed. In the worst case (no matches), each part of the alternation is tested against each character in the string. This is not very efficient.

因此,交替的速度较慢,对吧?不,因为...

So, alternations are slower, right? No, because...

同样,我们正在比较苹果和橙子,但与:

Again, we're comparing apples and oranges, but with:

$string = 'a/really_long_string';

合并的正则表达式实际上可能会更快,因为使用s/\.shtml//时,优化器必须先扫描大多数字符串,然后再拒绝匹配,而合并的正则表达式会快速匹配.

the combined regex may actually be faster because with s/\.shtml//, the optimizer has to scan most of the string before rejecting the match, while the combined regex matches quickly.

您可以基准这很有趣,但是由于您在比较不同的事物,因此它实际上是没有意义的

You can benchmark this for fun, but it's essentially meaningless since you're comparing different things.

这篇关于在正则表达式中使用替换是否比后续替换更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆