如何使用字符范围实现正则表达式NFA? [英] How to implement regular expression NFA with character ranges?

查看:81
本文介绍了如何使用字符范围实现正则表达式NFA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您阅读 Regex:NFA和汤普森的算法一切看起来都很简单,直到您在现实生活中意识到不仅需要直接字符"7"或"b",而且还需要:

When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:

[A-Z]
[^_]
.

即字符类(或范围).因此,我的问题是-如何使用字符范围构建NFA?使用非A",其他"之类的元字符,然后计算重叠范围?使用最终自动机时,这将导致使用树状结构,而不仅仅是表格.

namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.

更新:请假定大小字母(>> 256)不平凡.

Update: please assume non-trivial in size (>>256) alphabet.

我询问的是NFA,但稍后我也想将NFA转换为DFA.

推荐答案

最简单的方法是:

  1. 使用段作为NFA和DFA中过渡的标签.例如,范围[a-z]将被表示为段[97, 122];单字符"a"将变为[97,97];以及任何字符."会变成[minCode, maxCode].

  1. Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]; single character 'a' would become [97,97]; and any character '.' would become [minCode, maxCode].

每个否定范围[^ a-z]将导致从开始状态到下一个状态的两次转换.在此示例中,应创建两个过渡[minCode, 96][123, maxCode].

Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96] and [123, maxCode] should be created.

当通过枚举所有可能的字符[abcz]来表示范围时,应创建每个字符的过渡,或者将代码的第一组字符分成范围以优化过渡数量.因此[abcz]将变为[a-c]|z.因此是两个过渡而不是四个.

When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z. Thus two transitions instead of four.

对于NFA,这应该足够了.但是,当存在具有相交字符的过渡时,将NFA转换为DFA的经典电源集构造无效.范围. 要解决此问题,仅需要一个附加的概括步骤.一旦创建了所有输入符号的集合,在我们的示例中,它将是一组线段,则应将其转换为一组不相交的线段.这可以在时间 O(n * Log(n))中完成,其中n是使用优先级排队(PQ)的集合中的多个分段,其中分段按左组件排序.示例:

This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges. To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:

  Procedure DISJOIN:
   Input  <- [97, 99] [97, 100] [98, 108]
   Output -> [97, 97] [98, 99], [100, 100], [101, 108]

第2步.要从设置状态"创建新的过渡,应按以下方式修改算法:

Step 2. To create new transitions from a "set state" the algorithm should be modified as following:

   for each symbol in DISJOIN(input symbols)
     S <- empty set of symbols
     T <- empty "set state"
     for each state in "set state"
      for each transition in state.transitions
        I <- intersection(symbol, transition.label) 
        if (I is not empty)
        {
           Add I to the set S
           Add transition.To to the T
        }       

     for each segement from DISJOIN(S)
        Create transition from "set state" to T

要在搜索转换和输入符号C时加快匹配速度,可能按段对每个状态的转换进行排序,并应用二进制搜索.

To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.

这篇关于如何使用字符范围实现正则表达式NFA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆