perl regex大数据性能 [英] perl regex large data performance

查看:73
本文介绍了perl regex大数据性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1。)我从数据库中读取了大量数据(大约1000万条记录)。
2.)对于每条记录,我搜索并替换大约500个正则表达式。
3.)应用所有500个正则表达式后,将记录写入文件,然后处理下一条记录。

1.)I have a large amount of data that I read from the database (about 10million records). 2.)For each record, I search and replace about 500 regular expressions that I have. 3.) After applying all 500 regular expressions the record is then written to a file and then the next record is processed.

性能瓶颈正在应用从数据库中获取的每条记录上都有500个正则表达式。

The performance bottleneck is applying the 500 regular expressions on each and every record fetched from the database.

这是相关的代码块:

#normalizing the addresses fetched... this may take awhile
    while(my @row = $queryHandle->fetchrow_array())
    {
        #extract data from record
        $accountKey = @row[0];
        $addressLine1 = @row[1];
        $addressLine2 = @row[2];

        #iterate through all the regular expressions I have stored (about 500)
        for my $regexRef (@regexesList)
        {
            #get regular expression hash object
            my %regexObj = %{$regexRef};
            my $regexPattern = $regexObj{pattern}; #the regex pattern
            my $regexOutput = $regexObj{output}; #the replacement string

            #first remove all special characters leaving only numbers and alphabets
            $addressLine1 =~ s/[^A-Za-z0-9 ]//g;
            $addressLine2 =~ s/[^A-Za-z0-9 ]//g;

            #now standardize the addresses
            $addressLine1 =~ s/$regexPattern/$regexOutput/ig;
            $addressLine2 =~ s/$regexPattern/$regexOutput/ig;
        }

        my $normalizedAddress = lc($addressLine1 . $addressLine2);
        $normalizedAddress =~ s/\s+//g; #remove all white space

        print $dataFileHandle "${normalizedAddress}\n";
        $rowCount++;
    }

这是有效的代码,但是性能却很糟糕。目前该脚本已经运行了2.5个小时,已向输出文件中写入了313万条记录,还有700万条记录要去哈哈。

This is working code but the performance is abysmal. Currently the script has been running for 2.5hours and has written out 3.13 million records to the output file with about 7million to go haha.

这是最好的吗?还有另一种更快或更慢的方式吗?也许先将每一行写入文件,然后在整个文件上运行每个正则表达式?

Is this the best it can get? Is there another faster, or less slower way? Maybe writing each row to a file first and then run each regular expression on the whole file?

我想在尝试之前是否有更好的方法来实现这一点上面提到的替代方法

I would like to know if there is better way to implement this before I try the above mentioned alternative

推荐答案

您每次都要重新解析500-600个正则表达式,这需要时间。

You're reparsing your 500-600 regular expressions each time, and that takes time.

    $addressLine1 =~ s/$regexPattern/$regexOutput/ig; # Interpolate and reparse

以下内容是概念证明,它建立了一个匿名子程序,其中包括您的正则表达式

The following is a proof of concept that builds an anonymous subroutine that includes your regular expressions in literal code, instead of being interpreted from variables each time.

这表明性能提高了 10倍

use strict;
use warnings;

use Benchmark;

my @regexesList = map {{pattern => "foo$_", output => "bar$_"}} (1..600);

my $string1 = 'a' x 100;
my $string2 = 'b' x 100;

# Original code
sub original {
    my ($regexesList, $addressLine1, $addressLine2) = @_;

    #iterate through all the regular expressions I have stored (about 500)
    for my $regexRef (@regexesList) {
        #get regular expression hash object
        my %regexObj = %{$regexRef};
        my $regexPattern = $regexObj{pattern}; #the regex pattern
        my $regexOutput = $regexObj{output}; #the replacement string

        #now standardize the addresses
        $addressLine1 =~ s/$regexPattern/$regexOutput/ig;
        $addressLine2 =~ s/$regexPattern/$regexOutput/ig;
    }

    my $normalizedAddress = lc($addressLine1 . $addressLine2);
    $normalizedAddress =~ s{\s+}{}g; #remove all white space

    return $normalizedAddress;
}

# Build an anonymous subroutine to do all of the regex translations:
my $regex_code = "s/\\s+//g;\n";
for (@regexesList) {
    $regex_code .= "s/$_->{pattern}/$_->{output}/ig;\n";
}
my $code = <<"END_CODE";
    sub {
        my \@address = \@_;
        for (\@address) {
            $regex_code
        }
        return lc join '', \@address;
     }
END_CODE
my $address_sub = eval $code;
if ($@) {
    die "Invalid code $code: $@";
}

# Benchmark these two calling methods:
timethese(10000, {
    'original' => sub { original(\@regexesList, $string1, $string2) },
    'cached'   => sub { $address_sub->($string1, $string2) },
});

输出:

Benchmark: timing 10000 iterations of cached, original...
    cached:  4 wallclock secs ( 4.23 usr +  0.00 sys =  4.23 CPU) @ 2365.74/s (n=10000)
  original: 47 wallclock secs (47.18 usr +  0.00 sys = 47.18 CPU) @ 211.98/s (n=10000)

此外,您无需为循环的每次迭代都应用此正则表达式 s / [^ A-Za-z0-9] // g;

Additionally, you were needlessly applying this regex s/[^A-Za-z0-9 ]//g; for each iteration of your loop. That was unnecessary, and could've been applied outside the loop.

可能还有其他改进,但是您必须使用 基准测试 自己来找到它们,因为这并不是真正的目的的SO。

There are likely other improvements that can be made, but you'll have to utilize Benchmarking yourself to find them, as that's not really the purpose of SO.

这篇关于perl regex大数据性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆