查找匹配所有给定字符串的最简单的正则表达式 [英] Find simplest regular expression matching all given strings

查看:367
本文介绍了查找匹配所有给定字符串的最简单的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否存在一种可以从一组字符串中生成正则表达式(可能仅限于简化语法)的算法,以便对所有与该正则表达式匹配的可能字符串进行求值,就可以得出字符串的初始集合?

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?

找到这样一种非常复杂的语法(包括任意重复,断言等)的正则表达式语法算法可能是不现实的,所以让我们从一个简化的算法开始允许 OR 的子字符串:

It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:

foo(a | b | cd)bar 应该匹配 fooabar foobbar foocdbar

给出字符串集 h_q1_a h_q1_b h_q1_c h_p2_a h_p2_b h_p2_c ,算法的期望输出为 h_(q1 | p2) _(a | b | c)

Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).

给出字符串集 h_q1_a h_q1_b h_p2_a ,该算法的期望输出为 h_(q1_(a | b)| p2_a)请注意, h_(q1 | p2)_(a | b) 是不正确的,因为它会扩展为4个字符串,包括 h_p2_b ,它不在原始字符串集中。

Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.

我有一长串的标签,这些标签都是通过将子字符串放在一起而产生的。我不想打印大量的字符串,而是希望有一个紧凑的输出来指示列表中的标签。由于完整列表是通过编程生成的(使用有限的前缀和后缀集),因此我希望紧凑的表示法(可能)比初始列表要短得多。

I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.

((正则化)正则表达式应尽可能短,尽管我对实用的解决方案比对最佳的解决方案更感兴趣。所有字符串,例如A | B | C | D | ...都没有帮助。)

(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)

推荐答案

有如果要查找的是一组字符串的最小有限状态机(FSM),则可以直接解决此问题。由于生成的FSM不能具有循环(否则它将匹配无限数量的字符串),因此仅使用串联和分离( | )应该很容易转换为正则表达式操作员。尽管这可能不是最短的正则表达式,但是如果您使用的正则表达式库编译为最小化的DFA,则它将导致编译后的正则表达式最小。 (或者,您可以直接在类似Ragel的库中使用DFA。)

There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)

如果您可以使用标准的FSM算法,则过程很简单:

The procedure is simple, if you have access to standard FSM algorithms:


  1. 通过将每个字符串作为状态序列添加,使非确定性有限状态自动机(NFA),每个序列从头开始州。显然,字符串的总大小为 O(N),因为原始字符串中的每个字符都会有一个NFA状态。

  1. Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.

从NFA构造确定性有限状态自动机(DFA)。 NFA是一棵树,甚至不是DAG,它应避免标准算法的指数最坏情况。实际上,您只是在这里构建前缀树,而您可以跳过步骤1并直接构建前缀树,直接将其转换为DFA。前缀树的节点数不能超过原始字符数(如果所有字符串以不同字符开头,则前缀树数可以相同),因此其输出为 O(N)的大小,但我无法证明它在时间上也 O(N)

Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.

最小化DFA。

DFA最小化是经过充分研究的问题。 Hopcroft算法是最坏情况的 O(NS log N)算法,其中 N 是DFA中的状态数,而 S 是DFA的大小字母。通常, S 将被视为常量;无论如何,Hopcroft算法的预期时间要好得多。

DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.

对于非循环DFA,有线性时间算法;引用次数最多的是Dominique Revuz的作品,我发现了它的粗略描述此处为英文;原始论文似乎是付费的,但是 Revuz的论文(法语)可用。

For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

这篇关于查找匹配所有给定字符串的最简单的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆