如何使用有限自动机实施扫描仪 [英] How to use Finite Automaton to implement a scanner

查看:94
本文介绍了如何使用有限自动机实施扫描仪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个简单的扫描仪。假设我为我的语言定义了以下标记:

  !,!=,!==,< ;,<< ;,{

现在我可以使用正则表达式来指定它们,所以:

 !=?=? | {| << ;? 

然后我使用了

解决方案

最常见的规则是最大嚼数,它总是选择可能的最长令牌。



通常,使用DFA扫描单个令牌的算法如下:(要对输入进行令牌化,请重复执行该算法,直到到达输入的末尾为止,每次扫描都从上次扫描留下的输入光标处开始。)



将DFA状态设置为开始状态。然后,对于按顺序输入的每个字符:




  • 如果DFA在该字符上有过渡,请移至指示状态。如果该状态是接受状态,则记录当前的输入光标和状态号。


  • 否则,将输入倒退到最后记录的接受位置并返回记录的状态号。




此处,接受状态号用于指示遇到哪个令牌。实际上,将每个接受状态与令牌代码相关联是很常见的,因为某些令牌类型将具有多个接受状态。



不一定总是使用上述的回溯算法。在某些情况下,对DFA的分析将表明,最后一个接受状态始终位于紧接在前的输入位置。但是,许多语言都需要回溯。



例如,使用的语言... 都是令牌,但不是。.(例如C)必须在输入 .. 1上回溯code>,应标记为 .1 。在此输入的标记化中,第一次扫描将接受第一个,然后继续第二个(即而不是接受状态),然后在输入 1 上找不到过渡。然后它将报告它找到了(记录的接受令牌),并将输入光标重置到第二个字符位置,以便下一次扫描将看到令牌 .1


I'm building a simple scanner. Suppose I have the following tokens defined for my language:

!, !=, !==, <, <<, {

Now I can specify them using regular expressions, so:

!=?=? | { | <<?

Then I used http://hackingoff.com to build NFA and DFA. Each machine now can determine if the input is in the language of regexp or not. But my program is a sequence of tokens, not one token:

!!=!<!==<<!{

My question is how I should use the machines to parse the string into tokens? I'm interested in the approach rather then implementation.

解决方案

The most common rule is "maximal munch", which always selects the longest possible token.

In general, the algorithm to scan a single token using a DFA is as follows: (To tokenise an input, this algorithm is repeated until the end of the input is reached, with each scan starting at the input cursor left by the previous scan.)

Set the DFA state to the start state. Then, for each input character in sequence:

  • if the DFA has a transition on the character, move to the iindicated state. If that state is an accepting state, record the current input cursor and the state number.

  • otherwise, rewind the input to the last recorded accept position and return the recorded state number.

Here, the accepting state number is used to indicate which token was encountered. In practice, it is common to associate each accepting state with a token code, because some token types will have more than one accepting state.

It is not always necessary to use the backtracking algorithm described above. In some cases, analysis of the DFA will reveal that the last accepting state was always at the immediately preceding input position. However, many languages require backtracking.

For example, a language in which . and ... are both tokens but not .. (such as C) must backtrack on input ..1, which should be tokenised as ., .1. In the tokenisation of this input, the first scan will accept the first ., continue with the second . (which is not an accepting state) and then fail to find a transition on the input 1. It will then report that it found a . (the recorded accepted token) and reset the input cursor to the second character position, so that the next scan will see the token .1.

这篇关于如何使用有限自动机实施扫描仪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆