将字符集转换为nfa / dfa的高效算法 [英] Efficient algorithm for converting a character set into a nfa/dfa

查看:99
本文介绍了将字符集转换为nfa / dfa的高效算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用扫描仪生成器。
生成器已经可以正常工作了。但是,当使用字符类时,算法变得非常慢。

I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

扫描器生成器会为UTF8编码的文件生成一个扫描器。应该支持所有字符(0x000000到0x10ffff)。

The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.

如果我使用大字符集,例如any运算符'。'或unicode属性{L} ,nfa(以及dfa)包含很多状态(> 10000)。因此,将nfa转换为dfa并创建最小dfa会花费很长时间(即使输出的最小dfa仅包含少数几个状态)。

If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).

这是我当前实现的创建nfa字符集部分的实现。

Here's my current implementation of creating a character set part of the nfa.

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

有人知道如何更有效地实现该功能以仅创建必要的功能

Does anyone know how to implement the function much more efficiently to create only the necessary states?

编辑:

要更具体一点,我需要一个函数,例如:

To be more specific I need a function like:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

一个辅助函数,用于将字符(int)转换为UTF8编码字节[ ]定义为:

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:

byte[] EncodeCharacter(int character)
{ ... }


推荐答案

我将澄清您的要求:工会一组Unicode代码点,以便您生成状态最小的DFA,其中过渡表示这些代码点的UTF8编码序列。

I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.

当您说更有效时,可以适用于运行时,内存使用情况或最终结果的紧凑性。有限自动机中最小的通常含义是指使用最少的状态来描述任何给定的语言,这就是通过仅创建必要的状态而得到的。

When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".

每个有限自动机具有恰好一个等效的状态最小 DFA(请参见 Myhill-Nerode 定理[1]或Hopcroft& Ullman [2])。为了您的目的,我们可以直接使用Aho-Corasick算法[3]来构造此最小DFA。

Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].

为此,我们需要从Unicode代码点到它们对应的映射UTF8编码。无需预先存储所有这些UTF8字节序列;它们可以即时编码。 UTF8编码算法已有很好的文档记录,在此不再赘述。

To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.

Aho-Corasick的工作原理是首先构造一个 trie 。在您的情况下,这将是依次添加的每个UTF8序列的特里。然后,在其余算法中对该过渡进行注释,并将其转换为DAG。 这里的算法概述,但是我建议阅读

Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.

这种方法的伪代码:

trie = empty
foreach codepoint in input_set:
   bytes[] = utf8_encode(codepoint)
   trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC

这种方法(形成一个由UTF8编码的序列,然后是Aho-Corasick,然后渲染出DFA)是这种方法在我的regexp和有限状态机库的实现中采用,我在这里正是为了构造Unicode字符类而这样做的。在这里,您可以看到以下代码:

This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:

特里的构造: libre / ac .c

为每个字符类渲染最小DFA: libre / class /

Rendering out of minimal DFA for each character class: libre/class/

其他方法(如对该问题的其他答案所述)包括处理代码点和表达代码点范围,而不是拼出每个字节序列。

Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.

[1 ] Myhill-Nerode:Nerode,Anil(1958年),线性自动机变换,Proceedi AMS的ngs,9,JSTOR 2033204

[2] Hopcroft& Ullman(1979),第3.4节,定理3.10,第67页。[b]
[3] Aho,Alfred V .;玛格丽特·J·科拉西克(1975年6月)。 高效的字符串匹配:书目搜索的辅助工具。 ACM的通信。 18(6):333-340。

[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.

这篇关于将字符集转换为nfa / dfa的高效算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆