如何排序正则表达式替代项以获取最长匹配? [英] How to order regular expression alternatives to get longest match?

查看:202
本文介绍了如何排序正则表达式替代项以获取最长匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多正则表达式regex1regex2,...,regexN组合成单个正则表达式,作为regex1|regex2|...|regexN.我想对组件表达式进行重新排序,以使组合表达式在给定字符串的开头给出最长的匹配项.

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN. I would like to reorder the component expressions so that the combined expression gives the longest possible match at the beginning of a given string.

我相信这意味着对正则表达式进行重新排序,以便如果regexK匹配regexL的前缀,则L < K".如果这是正确的,那么通常是否可以找出regexK是否可以匹配regexL的前缀?

I believe this means reordering the regular expressions such that "if regexK matches a prefix of regexL, then L < K". If this is correct, is it possible to find out, in general, whether regexK can match a prefix of regexL?

推荐答案

使用正确的正则表达式!

在某些正则表达式中,提供最长匹配的替代是所使用的替代(贪婪替代").请注意,这些正则表达式大多数都是旧的(至今仍在使用),因此缺少一些现代构造,例如反向引用.

Use the right regex flavor!

In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.

Perl6是现代的(并具有许多功能),但默认为POSIX风格的最长交替. (您甚至可以切换样式,因为||将创建一个短路到第一个匹配项的交流发电机.)请注意,必须使用:Perl5/:P5修饰符才能使用传统"正则表达式样式.

Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.

此外,PCRE和更新的PCRE2具有相同的功能.在PCRE2中,它是pcre2_dfa_match. (有关DFA的更多信息,请参见我的有关正则表达式引擎设计的相关信息部分.)

Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)

(这与绝对最长"匹配不同,因为无需在词组中重新排列术语的数量,就不会改变所有正则表达式引擎从左到右遍历字符串的事实.除了.NET以外,显然,它可以从右到左移动.但是向后遍历字符串也不能保证绝对最长"匹配.)如果您确实只想(仅)在a的开头找到匹配项字符串,则应锚定表达式:^(regex1|regex2|...).

(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).

根据此页面*:

但是,POSIX标准要求返回最长匹配.将Set|SetValue应用于SetValue时,符合POSIX的正则表达式引擎将完全匹配SetValue.

The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.


*注意:我没有能力测试所有 POSIX风格.另外,某些正则表达式(Perl6)具有这种行为,但总体上不符合POSIX.


* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.

让我给您举一个我已经在自己的计算机上验证过的具体示例:

Let me give you one specific example that I have verified on my own computer:

echo "ab c a" | sed -E 's/(a|ab)/replacement/'

正则表达式为(a|ab).当它在字符串ab c a上运行时,您将得到:replacement c a,这实际上意味着您获得了交流发电机可以提供的最长匹配.

The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.

对于更复杂的示例,此正则表达式(应用于abcccd(a|ab.*c|.{0,2}c*d))将返回abcccd.

This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.

在此处尝试!

更多说明:regex引擎将无法继续匹配(在搜索字符串中),以查看是否可以匹配某些内容,甚至还有更长的匹配时间.它将仅浏览当前的更改列表,以查看是否有另一个匹配更长的字符串(从初始匹配开始的位置开始).

More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).

换句话说,无论变更的选择顺序如何,符合POSIX的正则表达式都使用与字符最多匹配的正则表达式.

In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.

  • Tcl ARE
  • POSIX ERE
  • GNU BRE
  • GNU ERE

此问题询问有关设计引擎的问题,但答案可能有助于理解这些引擎的工作方式.本质上,基于DFA的算法确定不同表达式的共同重叠,尤其是交替表达式中的表达式.可能值得在此页面中查看.它说明了如何将替代方案组合到一条路径中:

This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path:

注意:在某些时候,您可能只想考虑使用一种实际的编程语言.正则表达式不是全部.

Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.

这篇关于如何排序正则表达式替代项以获取最长匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆