如何在 Perl 中有效地匹配许多不同的正则表达式模式? [英] How can I efficiently match many different regex patterns in Perl?
问题描述
我有一个不断增长的正则表达式列表,我用它们来解析日志文件,搜索有趣的"错误和调试语句.我目前将它们分成 5 个桶,其中大部分分为 3 个大桶.到目前为止,我有 140 多种模式,而且这个列表还在不断增加.
I have a growing list of regular expressions that I am using to parse through log files searching for "interesting" error and debug statements. I'm currently breaking them into 5 buckets, with most of them falling into 3 large buckets. I have over 140 of patterns so far, and the list is continuing to grow.
大多数正则表达式都很简单,但它们也相当独特,所以我用单一模式捕获多个匹配项的机会很少.由于我匹配的内容的性质,模式往往是模糊的,因此很少匹配,所以我在每个输入行上做了大量的工作,最终结果是它无法匹配任何东西,或匹配最后的通用之一.
Most of the regular expressions are simple, but they're also fairly unique, so my opportunities to catch multiple matches with a single pattern are few and far between. Because of the nature of what I'm matching, the patterns tend to be obscure and therefor seldom matched against, so I'm doing a TON of work on each input line with the end result being that it fails to match anything, or matches one of the generic ones at the very end.
而且由于输入的数量(数百兆字节的日志文件),我有时需要等待一两分钟才能完成脚本.因此,我希望有一个更有效的解决方案.不过,我对为了速度而牺牲清晰度不感兴趣.
And because of the quantity of input (hundreds of megabytes of log files) I'm sometimes waiting for a minute or two for the script to finish. Hence my desire for a more efficient solution. I'm not interested in sacrificing clarity for speed, though.
我目前的正则表达式设置如下:
I currently have the regular expressions set up like this:
if (($line =~ m{Failed in routing out}) ||
($line =~ m{Agent .+ failed}) ||
($line =~ m{Record Not Exist in DB}) ||
...
有没有更好的方法来构建它,使其更高效,但仍可维护?谢谢!
Is there a better way of structuring this so it's more efficient, yet still maintainable? Thanks!
推荐答案
你可能想看看 Regexp::Assemble.它的目的就是为了处理这类问题.
You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.
来自模块概要的提升代码:
Boosted code from the module's synopsis:
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])
您甚至可以从单独的文件中提取正则表达式集合.
You can even slurp your regex collection out of a separate file.
这篇关于如何在 Perl 中有效地匹配许多不同的正则表达式模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!