词法分析器的字面量提取策略 [英] literals extraction policy for a lexical Analyzer

查看:176
本文介绍了词法分析器的字面量提取策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经建立了一个C语言的词法分析器,例如给定这个输入产生以下结果。



输入 p>

  int i = 0; int j = i + 3; 

输出

  int KEYWORD 
i IDENTIFIER
= OPERATOR
; PUNCTUATION
int KEYWORD
j IDENTIFIER
= OPERATOR
i IDENTIFIER
+ OPERATOR
3 INTEGER_CONSTANT
; PUNCTUATION

在上面的例子中,你可能已经注意到给定的输入在语法上是正确的,

 <$ c 

$ c> int i =1.2.2222。+ \< ++++



创建了一个类,其唯一的目的是将上述字符串拆分为小部分(我称之为字面量,不知道是否是正确的字词),可以与正则表达式匹配或使用DFA验证。



问题出现在模糊的情况下,如+其中+可以是一个加法运算符,或者是即将到来的整数文字的一部分,甚至是一个增量运算符的一部分。



如果+之前有一个+,它应该被当作一个增量操作符。简单来说,程序必须尝试寻找每一种可能性, 。这意味着如果程序有一些有效的输入,那么一些无效的输入再次有一些有效的输入,它不应该停止在那个无效的输入,而是继续找到正确的文字。对我来说,我反对它。我的参数是如果一个程序字符串在某个索引无效,它应该停止处理,因为我们不是一个错误检查系统。



我试图编码所有可能性使用复杂(对我)嵌套if else结构并获得部分成功。你可以告诉我一个更简单和优雅的解决方案。我也想到了将这个问题构造成一个状态机,但我不太确定,因为我从来没有实现一个状态机之前的DFA,只是告诉是或否的模式匹配。



你可以看到它是一个家庭作业问题,但我不是只要求代码。

解决方案

通常的词法分析方法是使用maximal munch算法:输入流通过重复采用可能是单个令牌的最长前缀来划分为令牌。有关一种算法,请参见此答案



有时需要(在c ++中,例如< :: 通常被赋值为< :: ),但总的来说,最大的munch规则很容易实现,更重要的是,读取。


I have built a lexical analyzer for a C like language which for example given this input produces the following result.

Input

int i = 0 ; int j = i + 3;

Output

int    KEYWORD
i      IDENTIFIER
=      OPERATOR
;      PUNCTUATION
int    KEYWORD
j      IDENTIFIER
=      OPERATOR
i      IDENTIFIER
+      OPERATOR
3      INTEGER_CONSTANT
;      PUNCTUATION

In the above example you may have noticed the given input was syntactically correct, however when I give it something like below it fails.

Input

int i = "1.2.2222.+\<++++

I have made a class whose sole purpose is to break the above string into small parts (i call them literals , don't know if it is the correct term)that can be matched with regex or validated with DFA.

Problem arises with the ambiguous situations like + where + can either be an addition operator, or a part of an upcoming integer literal or even part of an increment operator. My teacher requirement is explained in the next paragraph.

if a + is preceded by a + it should be processed as an increment operator. In simple words the program must try to look for every possibility and choose the best. That means if the program has some valid input then some invalid input the again some valid input it should not stop at that invalid input instead keep finding the correct literals. For me though I am against it. My argument is if a program string becomes invalid at a certain index it should stop processing because we are not writing an error checking system after all.

I have tried to code all possibilities using a complex (for me) nested if else structure and gotten partial success. Can nay of you suggest me a simpler and elegant solution. I have also thought of structuring this problem into a state machine but I am not too sure because I have never implemented a state machine before other than the a DFA that can just tell yes or no for pattern matching.

As you can see it is a homework question but I am not asking for just code.

解决方案

Tbe usual approach to lexical analysis is to use the "maximal munch" algorithm: the input stream is divided into tokens by repeatedly taking the longest prefix which could be a single token. See this answer for one algorithm.

It is occasionally necessary to make exceptions to this rule (in c++, for example, <:: is normally lexed into <, ::) but on the whole, the maximal munch rule is easy to implement and, more importantly, to read.

这篇关于词法分析器的字面量提取策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆