求正则表达式优化器 [英] Seeking regex optimizer

查看:57
本文介绍了求正则表达式优化器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串列表ls = [s_1,s_2,...,s_n]并希望从中创建一个

正则表达式sx,这样sx.match(s)会产生当s以[0,...,n]中的一个i的s_i开始时,一个SRE_Match

对象。可能

是这些字符串之间的关系:s_k.startswith(s_1) - >是或

s_k.endswith(s_1) - >真正。一个极端的例子是ls = [''a'','aa'',

....,''aaaa ... ab'']。出于这个原因,SRE_Match应该提供最长的

可能匹配。


是否有一个Python模块能够从ls创建一个优化的正则表达式rx

对于给定的约束条件?


问候,



I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = [''a'', ''aa'',
....,''aaaa...ab'']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

推荐答案



Kay Schluehr写道:

Kay Schluehr wrote:
我有一个字符串列表ls = [s_1,s_2,...,s_n]并想要创建一个
来自它的正则表达式sx,当s以[0,...,n]中的一个i的s_i开始时,sx.match(s)产生一个SRE_Match
对象。这些字符串之间可能存在关系:s_k.startswith(s_1) - >是或
s_k.endswith(s_1) - >真正。一个极端的例子是ls = [''a'','aa'',
......,'aaaa ... ab'']。出于这个原因,SRE_Match应该提供最长的可能匹配。

是否有一个Python模块能够根据给定的约束从ls
创建优化的正则表达式rx?

问候,
Kay
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = [''a'', ''aa'',
...,''aaaa...ab'']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay




一开始就是:

regexp =" ^(" + ; |" .join(sorted(ls,reverse = True))+")"

但如果您的特殊字符在$ / b $上,则上述功能不起作用b字符串。


你说你想要一些优化的东西。你试过什么?

- 垫。



A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.

You say you want something that is optimised. What have have you tried?
- Pad.


Paddy写道:
Kay Schluehr写道:
Kay Schluehr wrote:
我有一个字符串列表ls = [s_1,s_2,...,s_n]并希望从中创建一个
正则表达式sx,例如sx.match(s当s以[0,...,n]中的一个i的s_i开始时,产生一个SRE_Match
对象。这些字符串之间可能存在关系:s_k.startswith(s_1) - >是或
s_k.endswith(s_1) - >真正。一个极端的例子是ls = [''a'','aa'',
......,'aaaa ... ab'']。出于这个原因,SRE_Match应该提供最长的可能匹配。

是否有一个Python模块能够根据给定的约束从ls
创建优化的正则表达式rx?

问候,
Kay
一个开始是:
regexp =" ^(" +" |" .join(sorted(ls,reverse = True)) +")"
但如果您的
字符串中有特殊字符,则上述操作无效。
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = [''a'', ''aa'',
...,''aaaa...ab'']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay
A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.




对于特殊字符,可能是一个使用转义的解决方法。这个

确实很重要,但我认为应该将问题分成

单独的部分。

你说你想要一些优化的东西。您尝试了什么?



For special characters there might be a workaround using escapes. This
is indeed important, but I do think one should split the problem into
separate parts.
You say you want something that is optimised. What have have you tried?




对列表进行排序并检查后续内容。说你有ls

= [''x'',''a'',''aa'',''aab''''''ab'''


这可以映射到:


''x | a(?:( ?: ab)?| b?| a?)''


或者:


''^(x | ab | aab | aa | a)''


与您的提案中的反向排序。天真的解决方案很容易生成,但我对其成本效益持怀疑态度。另一方面,我不想调查此事,如果其他人已经完全做了这件事。


问候,

Kay



Sorting the list and checking for successor inclusions. Say you have ls
= [''x'',''a'', ''aa'', ''aab'' ,''ab'']

This can be mapped to:

''x|a(?:(?:ab)?|b?|a?)''

or to:

''^(x|ab|aab|aa|a)''

with reverse sorting as in your proposal.The naive solution is easy to
generate but I''m sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay


On 06/06/2006 6:30 AM,Paddy写道:
On 19/06/2006 6:30 AM, Paddy wrote:
Kay Schluehr写道:
Kay Schluehr wrote:
我有一个字符串列表ls = [s_1,s_2,...,s_n]并希望从中创建一个
正则表达式sx,例如sx。当s以[0,...,n]中的一个i的s_i开始时,match(s)产生一个SRE_Match
对象。这些字符串之间可能存在关系:s_k.startswith(s_1) - >是或
s_k.endswith(s_1) - >真正。


Kay,一个字符串是另一个字符串后缀的相关性是什么?我不知道这会怎样影响结果。


一个极端的例子是ls = [''a'',''aa '',...,''aaaa ... ab'']。出于这个原因,SRE_Match应该提供最长的可能匹配。

是否有一个Python模块能够根据给定的约束从ls
创建优化的正则表达式rx?


优化什么?速度?易于构建?


我认为你会确保列表成员是唯一的。


注意Python正则表达式引擎会考虑每个候选人

帕迪的解决方案从左到右直到得到一个匹配或到达终点

(这就是为什么需要反向排序才能获得最长的匹配)。

这是最坏情况的O(N),其中N是列表中所有

字符串的总长度。


据我所知,这是唯一的基本解决方案(使用Python的重新模拟一两个

twiddles - 见下文)。


您可以考虑生成zzz | foo(?:bar)?| aaa"而不是

" zzz | foobar | foo | aaa" - 但这是否足够快

来抵消建筑成本是任何人的猜测。


列表中有多少个字符串?平均/最大长度?可能性

ls [i] .startswith(ls [j])==是吗? unicode或str?


您的要求相当受限:sx.match(s)产生一个

SRE_Match对象你为什么需要这个?当然你所需要的只是

matched_len(可能为零),这样s [:matched_len]就是匹配的

前缀。


我原以为这样做的方法就是简单的逐字符树/特里驱动查找。这将是最糟糕的情况

O(n)其中n是列表中最长字符串的长度。可能

这是一个Python可调用的小工具,用于网络上的某个地方。 Google

" Danny Yoo ahocorasick"对于一个Python可调用的解决方案来解决一个类似的问题但是b $ b更复杂的问题。


一个使用Python的廉价黑客:根据第一个问题划分问题

字符:


prefix_dict_match = {

''''':re.compile(''alpaca | alligator'' )。匹配,

''f'':re.compile(''foobar | foo'')。匹配,

''z'':re.compile (''zoo | zebra'')。匹配,

}

如果s和s [0]在prefix_dict_match中:

match_obj = prefix_dict_match [s [0]](s)

else:

match_obj =无

问候,
Kay
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True.
Kay, what is the relevance of one string being a suffix of another? I
don''t see how that could affect the outcome.

An extreme case would be ls = [''a'', ''aa'', ...,''aaaa...ab'']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?
Optimised with respect to what? speed? ease of construction?

I presume that you will ensure that the members of the list are unique.

Note that the Python regex engine will consider each candidate in
Paddy''s solution left to right until it gets a match or reaches the end
(that''s why the reverse sort is needed to get longest possible match).
This is worst-case O(N) where N is the total of the lengths of all the
strings in your list.

As far as I can see, this is the only basic solution (modulo one or two
twiddles -- see below) using Python''s re.

You could possibly consider producing "zzz|foo(?:bar)?|aaa" instead of
"zzz|foobar|foo|aaa" -- but whether that would run sufficiently faster
to offset the cost of construction is anybody''s guess.

How many strings in your list? Average/maximum length? Likelihood of
ls[i].startswith(ls[j]) == True? unicode or str?

Your requirements are rather constrained: "sx.match(s) yields a
SRE_Match object" ... why do you need this? Surely all you need is
matched_len (which may be zero) such that s[:matched_len] is the matched
prefix.

I would have thought the way to approach this would be a simple
character-by-character tree/trie-driven lookup. This would be worst case
O(n) where n is the length of the longest string in your list. There may
well be a Python-callable gadget for this on the net somewhere. Google
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but
more complex problem.

A cheap hack using Python''s re: divide the problem according to first
character:

prefix_dict_match = {
''a'': re.compile(''alpaca|alligator'').match,
''f'': re.compile(''foobar|foo'').match,
''z'': re.compile(''zoo|zebra'').match,
}
if s and s[0] in prefix_dict_match:
match_obj = prefix_dict_match[s[0]](s)
else:
match_obj = None

Regards,
Kay



一个开始是:
regexp =" ^(" +" |" .join(sorted(ls,reverse = True))+")" 字符串中有特殊字符,则上述方法无效。



A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.




Paddy,修复该问题,以及优化删除多余的

^()元字符:


regexp =" |" .join(map(re.escape,sorted(ls,reverse) =真)))

希望其中一些有帮助,

问候,

John



Paddy, fixing that problem, and "optimising" by removing the redundant
^() metacharacters:

regexp = "|".join(map(re.escape, sorted(ls, reverse=True)))
Hoping some of this helps,
Regards,
John


这篇关于求正则表达式优化器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆