Perl的正则表达式大数据服务表现 [英] perl regex large data performace

查看:298
本文介绍了Perl的正则表达式大数据服务表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1)我有大量的数据,我从数据库(约1000万条记录)读取。 2)对于每一个记录,我搜索和替换约500个常规EX pressions,我有。 3)将所有500个常规EX pressions记录随后被写入到一个文件,之后再下一条记录被处理。

1.)I have a large amount of data that I read from the database (about 10million records). 2.)For each record, I search and replace about 500 regular expressions that I have. 3.) After applying all 500 regular expressions the record is then written to a file and then the next record is processed.

的性能瓶颈正在申请对每一个记录从数据库中获取的500个常规EX pressions。

The performance bottleneck is applying the 500 regular expressions on each and every record fetched from the database.

下面是相关code座:

Here is the relevant code block:

#normalizing the addresses fetched... this may take awhile
    while(my @row = $queryHandle->fetchrow_array())
    {
        #extract data from record
        $accountKey = @row[0];
        $addressLine1 = @row[1];
        $addressLine2 = @row[2];

        #iterate through all the regular expressions I have stored (about 500)
        for my $regexRef (@regexesList)
        {
            #get regular expression hash object
            my %regexObj = %{$regexRef};
            my $regexPattern = $regexObj{pattern}; #the regex pattern
            my $regexOutput = $regexObj{output}; #the replacement string

            #first remove all special characters leaving only numbers and alphabets
            $addressLine1 =~ s/[^A-Za-z0-9 ]//g;
            $addressLine2 =~ s/[^A-Za-z0-9 ]//g;

            #now standardize the addresses
            $addressLine1 =~ s/$regexPattern/$regexOutput/ig;
            $addressLine2 =~ s/$regexPattern/$regexOutput/ig;
        }

        my $normalizedAddress = lc($addressLine1 . $addressLine2);
        $normalizedAddress =~ s/\s+//g; #remove all white space

        print $dataFileHandle "${normalizedAddress}\n";
        $rowCount++;
    }

这是工作code,但表现糟糕透顶。目前剧本已经运行了2.5小时,并撰写了313万条记录到输出文件约700万去,哈哈。

This is working code but the performance is abysmal. Currently the script has been running for 2.5hours and has written out 3.13 million records to the output file with about 7million to go haha.

这是最好的,可以得到什么呢?是否有另一种速度更快,或更慢的方法是什么?也许写的每一行到一个文件,然后再在整个文件运行的每个常规EX pression?

Is this the best it can get? Is there another faster, or less slower way? Maybe writing each row to a file first and then run each regular expression on the whole file?

我想知道是否有更好的方法来实现此之前,我尝试上述替代

I would like to know if there is better way to implement this before I try the above mentioned alternative

谢谢大家!

推荐答案

您正在重新分析您的500-600常规的前$ P $每次pssions,而这需要时间。

You're reparsing your 500-600 regular expressions each time, and that takes time.

    $addressLine1 =~ s/$regexPattern/$regexOutput/ig; # Interpolate and reparse

下面是一个概念证明的建立,而不是从变量每次跨preTED一个匿名的子程序,其中包括你的正常EX pressions在面值code,

The following is a proof of concept that builds an anonymous subroutine that includes your regular expressions in literal code, instead of being interpreted from variables each time.

此演示的 10倍的改善的性能。

use strict;
use warnings;

use Benchmark;

my @regexesList = map {{pattern => "foo$_", output => "bar$_"}} (1..600);

my $string1 = 'a' x 100;
my $string2 = 'b' x 100;

# Original code
sub original {
    my ($regexesList, $addressLine1, $addressLine2) = @_;

    #iterate through all the regular expressions I have stored (about 500)
    for my $regexRef (@regexesList) {
        #get regular expression hash object
        my %regexObj = %{$regexRef};
        my $regexPattern = $regexObj{pattern}; #the regex pattern
        my $regexOutput = $regexObj{output}; #the replacement string

        #now standardize the addresses
        $addressLine1 =~ s/$regexPattern/$regexOutput/ig;
        $addressLine2 =~ s/$regexPattern/$regexOutput/ig;
    }

    my $normalizedAddress = lc($addressLine1 . $addressLine2);
    $normalizedAddress =~ s{\s+}{}g; #remove all white space

    return $normalizedAddress;
}

# Build an anonymous subroutine to do all of the regex translations:
my $regex_code = "s/\\s+//g;\n";
for (@regexesList) {
    $regex_code .= "s/$_->{pattern}/$_->{output}/ig;\n";
}
my $code = <<"END_CODE";
    sub {
        my \@address = \@_;
        for (\@address) {
            $regex_code
        }
        return lc join '', \@address;
     }
END_CODE
my $address_sub = eval $code;
if ($@) {
    die "Invalid code $code: $@";
}

# Benchmark these two calling methods:
timethese(10000, {
    'original' => sub { original(\@regexesList, $string1, $string2) },
    'cached'   => sub { $address_sub->($string1, $string2) },
});

输出:

Benchmark: timing 10000 iterations of cached, original...
    cached:  4 wallclock secs ( 4.23 usr +  0.00 sys =  4.23 CPU) @ 2365.74/s (n=10000)
  original: 47 wallclock secs (47.18 usr +  0.00 sys = 47.18 CPU) @ 211.98/s (n=10000)

此外,你不必要应用此正则表达式 S / [^ A-ZA-Z0-9] //克; 为您的每个循环迭代。这是不必要的,外循环可以一直使用。

Additionally, you were needlessly applying this regex s/[^A-Za-z0-9 ]//g; for each iteration of your loop. That was unnecessary, and could've been applied outside the loop.

有可以做出有可能其他方面的改进,但你必须利用 标杆 自己找到他们,因为这是不是真的,这样的目的。

There are likely other improvements that can be made, but you'll have to utilize Benchmarking yourself to find them, as that's not really the purpose of SO.

这篇关于Perl的正则表达式大数据服务表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆