标准C库正则表达式性能问题 [英] Standard C Library regex performance issue

查看:67
本文介绍了标准C库正则表达式性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Python编写的小实用程序,它工作得非常好,因此我决定在C中实现它。

我做了一些Python的基准测试代码性能。该程序的一部分

使用Python的标准re(正则表达式)

模块来解析输入文件。由于Python的例程从

文件和正则表达式读取很可能是通过本机

库实现的,我希望C代码可以读取和解析

文件使用完全相同的方案,会显示大致相同的

性能(或者更好)。

我很惊讶地发现C中的代码工作方式比使用Python中的相同代码慢得多(实际上大约慢300倍!!!)。

我在Unix下运行它(我猜该版本不应该真的

重要)并且使用gcc和-O2编译C代码。


代码在两种语言中完全相同:<包含regexp库(Python中的模块,c中的regex.h)

2.编译表达式(同样的表达式用于小的
差异导致

库接受的语法稍有不同)

3.逐行读取输入文件

4 .p使用编译的regexp评估该行(在Pthon中它是调用

of .match(..),在C中它是regexex(...)的调用)。 />
没有更多!


有谁知道这是什么问题?

I have a small utility program written in Python which works pretty
slow so I''ve decided to implement it in C.
I did some benchmarking of Python''s code performance. One of the parts
of the program is using Python''s standard re (regular expressions)
module to parse the input file. As Python''s routines to read from the
file and regular expressions are most likely implemented via native
libraries I would expect that the C code, which reads and parses the
file using exactly the same scheme, would show approximately the same
performance (or maybe better).
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
I am running it all under Unix (I guess the version should not really
matter) and am using gcc with -O2 to compile C code.

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it''s the call
of .match(..), in C it''s the call of regexex(...)).
NOTHING MORE!

Does anyone know what''s the problem?

推荐答案

ig*********@gmail.com 写道:

我很惊讶地发现C中的代码工作方式慢了多少(实际上大约慢了300倍!!!)比同样的Python中的代码。

(...)

这两种语言的代码完全相同:


1. inludes regexp库(Python中的模块,c中的regex.h)
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
(...)
The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)



这些可能使用不同的regexp实现,具有完全不同的

语义。也许你没有将你的Python regexps翻译成

等价的regex.h regexps。或者你可能会使用一些效率非常低的
regexp,Python会对其进行优化,但regex.h不会。

可以非常轻松地编写_very_ slow regexps。

These probably use different regexp implementations with quite different
semantics. Maybe you didn''t translate your Python regexps to the
equivalent regex.h regexps. Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.


2.编译表达式(相同的表达式用于小的

差异,因为

接受的语法略有不同库)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)



希望你在循环外只执行一次这样的操作?

Hopefully you do this just once, outside the loop?


3.读取输入文件逐行
3. reads input file line by line



怎么样?一个fgets()到一个char缓冲区[],或更聪明的东西?

How? One fgets() into a char buffer[], or something more clever?


4.使用编译的regexp解析该行(在Pthon中它是调用

of .match(..),在C中它是regexex(...)的调用。

没有更多!

有谁知道这是什么问题?
4. parses the line using compiled regexp (in Pthon it''s the call
of .match(..), in C it''s the call of regexex(...)).
NOTHING MORE!

Does anyone know what''s the problem?



也可能与你的代码中的其他东西有关,你没有提到
。当你不发布你的代码时很难猜到。


-

Hallvard

Might also be related to something quite else in your code, which you
haven''t mentioned. Hard to guess when you don''t post your code.

--
Hallvard


ig*********@gmail.com 写道:
ig*********@gmail.com wrote:

两种语言的代码完全相同:


1.包括regexp库(Python中的模块,c中的regex.h)

2.编译表达式(相同的表达式用于小的

差异,因为

库接受的语法略有不同)

3.逐行读取输入文件

4.使用编译的regexp解析该行(在Pthon中,它是调用

of .match(。 。),在C中它是regexex(...)的调用。

没有更多!


有谁知道'是什么'问题?
The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it''s the call
of .match(..), in C it''s the call of regexex(...)).
NOTHING MORE!

Does anyone know what''s the problem?



查看Python是否编译为使用PCRE。如果是这样,那么你的版本使用

regcomp和regexec是不一样的。 comp.unix.programmer领域btw。

See if Python was compiled to use PCRE or not. IF so, then your version using
regcomp and regexec is not the same. comp.unix.programmer territory btw.


On 16 ??×,14:57,Hallvard B Furuseth< hbfurus ... @ usit.uio.no>

写道:
On 16 ??×, 14:57, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
wrote:

igor.kul ... @ gmail.com写道:
igor.kul...@gmail.com writes:

我惊讶地发现C中的代码比Python中的相同代码慢了多少
(实际上慢了大约300倍!!!)。
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.



这些可能使用不同的regexp实现,具有完全不同的

语义。也许你没有将你的Python regexps翻译成

等价的regex.h regexps。

These probably use different regexp implementations with quite different
semantics. Maybe you didn''t translate your Python regexps to the
equivalent regex.h regexps.



我实际上尝试运行regexps,我使用regex.h调整了regex.h

它们与我想的相匹配希望他们完美匹配


I''ve actually tryed running the regexps which I adjusted for regex.h
using regex.h and they match what I would want them to match
perfectly.


或者你可能使用一些非常低效的
regexps,Python重新管理优化,但regex.h没有。

可以轻松地编写_very_ slow regexps。
Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.



这可能是真的。尽管很长时间仍然是regexp应该是非常简单的



这是正则表达式(我会理解,如果没有人会读它):

^([[:alpha:]] {3} + [[:digit:]] {1,2} + [[:digit:]] {1,2}:[[:digit:] ] {1,2}:

[[:digit:]] {1,2})+([^] +) - ([[:digit:]] +) - ([[ :alnum:]] +)\\ [([[:digit:]]

+)\\] +([^] +)(+(([^] + )\\(([[:digit:]] *)\\)))?:(。*)\\ n?

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\\(([[:digit:]]*)\\)))?: (.*)\\n?


这篇关于标准C库正则表达式性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆