如何枚举上下文无关语法的字符串? [英] How to enumerate the strings of a context-free grammar?

查看:100
本文介绍了如何枚举上下文无关语法的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您使用什么算法枚举上下文无关文法生成的字符串?



在没有递归的情况下似乎可行,但我无法确定列出了可能包含所有(可能是间接的)递归的一般情况。



(我不是在寻找深奥的解决方案就像此页面上的那个一样;我'正在寻找可以映射到标准命令性代码的算法。)

解决方案

低效算法:

 构造R,语法的Earley解析器。 

对于A *中的每个字符串S(A是语法字母):
如果R识别出S:
输出S

在这里,我跳过了构造 R 的细节-例如,参见 Earley的论文,或更简洁地说,有关 Earley算法的Wikipedia文章。我还跳过了枚举A *中所有字符串的问题,这是一个简单的 | A | 计数器。



很显然,通过使用Earley解析器本身来避免(某些)死角,可以使该算法更加有效。而不是枚举 A * 中的所有字符串,我们从< string,state-set> 元组开始,初始化为由空字符串和空状态集组成的元组。然后,我们(无限地)从队列的开头删除一个元组,并将所有可能的元组添加到队列的末尾,这些元组可以通过将 A 中的一个符号输入到Earley解析器(通常,解析器将无法处理每个符号;实际上,它可能无法处理任何符号)。如果解析器识别出元组中的字符串,则将其输出。



在两种情况下,如果我们知道给定的语法属于CFG的一些更易于解析的子集,我们可以用Earley解析器代替语法更有效的解析器。 :首先按长度,在给定的长度内,按字典顺序排列,这样可以保证即使语法不明确,每个字符串也将生成一次。



另一种解决方法是构造字符串按所需减少的数量排序;实际上,这将生成所有(最左侧)的减少量。在这里,我们以开始符号开始一个队列,然后重复:

 删除队列中的第一句。 
如果仅包含终端,则将其输出。
否则,对于句子中第一个非终结符的每个生成,
会将扩展该生成的结果附加到队列中。

上述算法适用于明确的语法,但给定语法不明确,它将多次生成句子(一次按最左边的导数)。要解决该问题,我们首先可以将语法转换为 Chomsky范式(请参阅算法链接)。然后,我们为终端和非终端创建总排序,其中非终端都在终端之前,而句子的相应顺序是较短句子在较长句子之前,等长句子按字典顺序排序。然后,我们使用上述算法,但使用优先级队列而不是FIFO队列,并在处理重复条目之前消除它们。



在CNF中,长度为-n-字典顺序,所有的产生都在增加,因为它们要么用一个终结符代替一个非终结符,要么使句子加长一个符号。 (其余正确性的证明是通过归纳法得出的。)因此,完全衍生的句子将按照先长后词典的顺序进行枚举,就像开始该答案的朴素算法一样。


What algorithm do you use to enumerate the strings generated by a context-free grammar?

It seems doable when there is no recursion, but I can't figure out how to do it in the general case, which might contain all kinds of (possibly indirect) recursion.

(I'm not looking for an esoteric solution like the one on this page; I'm looking for an algorithm that I could map to standard imperative code.)

解决方案

Here's an obvious but somewhat inefficient algorithm:

Construct R, the Earley parser for the grammar.

For each string S in A* (A is the alphabet for the grammar):
  If R recognizes S:
    Output S

Here I skip the details of constructing R -- see, for example, Earley's thesis or, more concisely, the Wikipedia article on the Earley algorithm. I also skip the problem of enumerating all strings in A*, which is a simple base |A| counter.

Clearly, this algorithm can be made more efficient by using the Earley parser itself to avoid (some) dead-ends. Instead of enumerating all strings in A*, we start with a queue of <string, state-set> tuples, initialized to the tuple consisting of the empty string and the empty state set. We then (infinitely) remove one tuple from the head of the queue and add to the end of the queue all possible tuples which can be constructed by feeding one symbol from A into the Earley parser (typically, the parser will not be able to handle every symbol; in fact, it may not be able to handle any). If the parser recognizes the string in the tuple, we output it.

In both cases, if we know that the given grammar belongs to some more easily parseable subset of CFGs, we could substitute the Earley parser for a more efficient parser for the grammar.

Both of the above algorithms have the advantage of generating the strings of the language in a simple predictable order: first by length, and within a given length, lexicographically, which guarantees that each string will be generated exactly once even if the grammar is ambiguous.

Another solution is to construct the strings in order by the number of reductions required; in effect, this generates all (leftmost) reductions. Here we start a queue with the start symbol, and then repeatedly:

Remove the first sentence in the queue.
If it contains only terminals, output it.
Otherwise, for each production for the first non-terminal in the sentence,
  append to the queue the result of expanding that production.

The above algorithm will work fine for unambiguous grammars but given an ambiguous grammar, it will generate sentences multiple times (once per leftmost derivation). To fix that problem, we could first convert the grammar to Chomsky Normal Form (see link for algorithm). We then create a total ordering for terminals and non-terminals in which the non-terminals all precede the terminals, and a corresponding order for sentences in which a shorter sentence precedes a longer sentence, and equal-length sentences are ordered lexicographically. Then we use the above algorithm, but using a priority queue instead of a FIFO queue, and eliminate duplicate entries before processing them.

In CNF, with the length-then-lexicographic ordering, all productions are increasing, since they either replace a non-terminal with a terminal, or they make the sentence one symbol longer. (The rest of the proof of correctness is by induction.) Consequently, the fully-derived sentences will be enumerated in length-then-lexicographic order, just like the naive algorithm which started this answer.

这篇关于如何枚举上下文无关语法的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆