Regexp优化问题 [英] Regexp optimization question

查看:73
本文介绍了Regexp优化问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个项目(Atox),我需要在相当大的文本文件中匹配相当多的
正则表达式(几百个)。

我发现这很容易变慢。 (有很多东西可以降低Atox的速度 - 它没有为速度而设计,而且任何优化都需要进行相当多的重构。)


我试图通过使用与SPARK相同的技巧加快速度,将所有正则表达式中的所有正则表达式放入一个新的正则表达式中的单个或组中。这帮助了一个

*很多* - 但现在我必须找出哪一个匹配在某个位置的某个位置。我还没看过代码的性能

来检查这个,因为我之前遇到了一个问题:使用

命名组只适用于100种模式 - 不是一个可怕的问题,因为我可以创建几个100组模式 - 并使用

命名组减慢匹配*很多*。据我所知,使用命名组的
实际上比单独匹配

模式要慢。


那么:我该怎么办?有没有什么方法可以在这里获得更多的速度,除了在C或Pyrex中实现匹配代码(即在调用周围的代码/ b $ b到_re)之间的代码是什么? (我尝试过使用Psyco,但这并没有帮助;

我想如果我以不同的方式实现它可能会有所帮助......)


任何想法?


-

Magnus Lie Hetland压迫和骚扰是一个很小的代价来支付
http://hetland.org 住在自由的土地上。 - CM Burns

I''m working on a project (Atox) where I need to match quite a few
regular expressions (several hundred) in reasonably large text files.
I''ve found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn''t been designed for speed, and
any optimizations will entail quite a bit of refactoring.)

I''ve tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven''t yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.

So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I''ve tried using Psyco, but that didn''t help;
I guess it might help if I implemented things differently...)

Any ideas?

--
Magnus Lie Hetland "Oppression and harassment is a small price to pay
http://hetland.org to live in the land of the free." -- C. M. Burns

推荐答案

Magnus Lie Hetland写道:
Magnus Lie Hetland wrote:
我试图通过使用与SPARK相同的技巧,将所有正则表达式放入新正则表达式中的单个或组中。这有助于
*很多* - 但现在我必须找出哪一个匹配在某个位置。
I''ve tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location.




Are你使用匹配对象的.lastindex属性了吗?


Martin



Are you using the .lastindex attribute of match objects yet?

Martin


Magnus Lie Hetland< ml * @ furu.idi.ntnu.no>写道:
Magnus Lie Hetland <ml*@furu.idi.ntnu.no> wrote:
任何想法?




几个具体的例子,也许?可悲的是,我的远程电源不是以前的价格......


-

William Park,开放式几何咨询,< op ********** @ yahoo.ca>

Linux解决方案/培训/迁移,瘦客户端



Few concrete examples, perhaps? Sadly, my telepathetic power is not
what it used to be...

--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution/training/migration, Thin-client




" Magnus Lie Hetland" < ml*@furu.idi.ntnu.no> schrieb im Newsbeitrag

新闻:slrnc8gal3.9da.ml*@furu.idi.ntnu.no ...

"Magnus Lie Hetland" <ml*@furu.idi.ntnu.no> schrieb im Newsbeitrag
news:slrnc8gal3.9da.ml*@furu.idi.ntnu.no...

任何想法?

Any ideas?




也许Plex很有帮助。我没有使用它,但它似乎是你的问题

问题

Plex的作者是Greg Ewing。他在Plex上建造Pyrex


文件
http://www.cosc.canterbury.ac.nz/~gr...doc/index.html

包含


"""

Plex旨在满足现有Python所需的需求

正则表达式模块。如果你曾尝试使用其中一个实现扫描仪,那么你会发现它们并不适合这个任务。你可以定义一堆与你的

标记匹配的正则表达式,但是你只能在你的

输入中一次匹配其中一个。为了同时匹配所有这些,你必须将它们一起加入到

一个大的重新组合中,但是你没有简单的方法来判断哪一个匹配。

这是Plex旨在解决的问题。

"""


希望我能帮到你

Guenter



Maybe Plex is helpful. I did not use it already, but it seems to adress your
problem
The author of Plex is Greg Ewing. He build Pyrex on top of Plex

The documentation
http://www.cosc.canterbury.ac.nz/~gr...doc/index.html
contains

"""
Plex is designed to fill a need that is left wanting by the existing Python
regular expression modules. If you''ve ever tried to use one of them for
implementing a scanner, you will have found that they''re not really suited
to the task. You can define a bunch of regular expressions which match your
tokens all right, but you can only match one of them at a time against your
input. To match all of them at once, you have to join them all together into
one big r.e., but then you''ve got no easy way to tell which one matched.
This is the problem that Plex is designed to solve.
"""

Hope I could help you
Guenter


这篇关于Regexp优化问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆