我应该在词法分析器和解析器之间划清界限吗? [英] Where should I draw the line between lexer and parser?

查看:147
本文介绍了我应该在词法分析器和解析器之间划清界限吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于教育目的,我正在为IMAP协议编写一个词法分析器,但对于应该在词法分析器和解析器之间划清界限的问题,我感到困惑.以IMAP服务器响应为例:

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response:

* FLAGS (\Answered \Deleted)

此响应是使用如下形式的语法定义的:

This response is defined in the formal syntax like this:

mailbox-data   = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

由于将它们指定为字符串文字(也称为终端"标记),对于词法分析器来说,为每个标记发出唯一标记会更正确,例如:

Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

或者发出这样的消息是否正确:

Or would it be just as correct to emit something like this:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

我的困惑是,前一种方法可能会使词法分析器过于复杂-如果\Answered在两个不同的上下文中具有两种含义,则词法分析器将不会发出正确的标记.作为一个人为的示例(因为电子邮件地址用引号引起,所以不会发生这种情况),词典处理器将如何处理\ Answered@googlemail.com这样的电子邮件地址?还是将正式语法设计为永远不允许这种歧义出现?

My confusion is that the former method could overcomplicate the lexer - if \Answered had two meanings in two different contexts the lexer wouldn't emit the right token. As a contrived example (this situation won't occur because e-mail addresses are enclosed in quotes), how would the lexer deal with an e-mail address like \Answered@googlemail.com? Or is the formal syntax designed to never allow such an ambiguity to arise?

推荐答案

作为一般规则,您不希望词汇语法传播到语法中,因为它只是细节.例如,像C这样的计算机编程语言的词法分析器肯定会识别数字,但是通常不适合生成HEXNUMBER和DECIMALNUMBER标记,因为这对语法并不重要.

As a general rule, you don't want lexical syntax to propagate into the grammar, because its just detail. For instance, a lexer for a computer programming langauge like C would certainly recognize numbers, but it is generally inappropriate to produce HEXNUMBER and DECIMALNUMBER tokens, because this isn't important to the grammar.

我认为您想要的是最抽象的标记,可让您的语法区分与您的目的相关的兴趣案例.您可以通过在语法的一部分中引起的混乱,在其他部分中可能做出的选择来进行调解.

I think what you want are the most abstract tokens that allows your grammar to distinguish cases of interest relative to your purpose. You get to mediate this by confusion caused in one part of the grammar, by choices you might make in other parts.

如果您的目标只是读取标志值,那么实际上您不需要区分它们,并且没有关联内容的TknFlag就足够了.

If your goal is simply to read past the flag values, then in fact you don't need to distinguish among them, and a TknFlag with no associated content would be good enough.

如果您的目标是单独处理标志值,则需要知道是否有ANSWERED和/或DELETED指示.它们的词汇拼写方式无关紧要;因此,我将使用您的TknAnsweredFlag解决方案.我会转储TknSpace,因为在任何标志序列中都必须有中间空格(您的规范这样说),因此我将尝试消除使用lexer提供的任何空白压缩机制.

If your goal is to process the flag values individually, you need to know if you got an ANSWERED and/or DELETED indications. How they are lexically spelled is irrelevant; so I'd go with your TknAnsweredFlag solution. I would dump the TknSpace, because in any sequence of flags, there must be intervening spaces (your spec say so), so I'd try to eliminate using whatever whitespace supression machinery you lexer offers.

有时,我会遇到很多类似国旗的事情.如果每个单词都有一个记号,那么您的语法就会变得混乱.如果语法不需要知道特定的标志,那么您应该拥有一个带有关联字符串值的TknFlag.如果语法需要标记的一小部分来区分,但大多数标记则不需要,那么您应该妥协:为那些与语法相关的标记使用单独的标记,并用关联的字符串捕获所有TknFlag

On occasion, I run into situations where there are dozens of such flag-like things. Then your grammar starts to become cluttered if you have a token for each. If the grammar doesn't need to know specific flags, then you should have a TknFlag with associated string value. If a small subset of the flags are needed by the grammar to distinguish, but most of them are not, then you should compromise: have individual tokens for those flags that matter to the grammar, and a catch all TknFlag with associated string for the rest.

关于具有两种不同解释的困难:这是这些权衡之一.如果您遇到该问题,那么您的令牌要么需要在语法中需要的两个地方都具有足够的详细信息,然后您就可以进行区分.如果"\"在语法中的其他位置与标记相关,则您当然可以同时产生TknBackSlash和TknAnswered.但是,如果某种语法在某部分中的处理方式与另一部分中的处理方式不同,则通常可以使用模式驱动的词法分析器来解决此问题.可以将模式看作是一个有限状态机,每个模式都有一个关联的(子)词法分析器.模式之间的转换由提示标记触发(您必须具有FLAGS标记;正是这样的提示,您即将选择标志值).在一种模式下,您可以产生其他模式不会产生的令牌.因此,在一种模式下,您可能会产生"\"令牌,但在标志模式下则不需要.模式支持在词法分析器中非常常见,因为您可能会想到此问题更为常见.有关示例,请参见Flex文档.

Regarding the difficulty in having two different interpretations: this is one of those tradeoffs. If you have that issue, then your tokens either need to have fine enough detail in both places where they are needed in the grammar so you can discriminate. If "\" is relevant as a token somewhere else in the grammar, you certainly could produce both TknBackSlash and TknAnswered. However, if the way something is treated in one part of the grammar is different than another, you can often get around this using a mode-driven lexer. Think of modes as being a finite state machine, each with an associated (sub)lexer. Transitions between modes are triggered by tokens that are cues (you must have a FLAGS token; it is preciseuly such a cue that you are about to pick up flag values). In a mode, you can produce tokens that other modes would not produce; thus in one mode, you might produce "\" tokens, but in your flag mode you wouldn't need to. Mode support is pretty common in lexers because this problem is more common that you might expect. See Flex documentation for an example.

您所提出的问题表明您在做出正确选择的正确道路上.您需要权衡最小化令牌的可维护性目标(从技术上讲,您可以使用令牌解析任何ASCII字符!)与基本需求之间的区别要足以满足您的需求.在构建了十二种语法之后,这种权衡似乎很容易,但是我认为我提供的经验法则非常好.

The fact that you are asking the question shows you are on the right track for making a good choice. You need to balance the maintainability goal of minimizing tokens (technically you can parse using a token for ever ASCII character!) with fundamental require to discriminate well enough for your needs. After you've built a dozen grammars this tradeoff will seem easy, but I think the rules of thumb I've provided are pretty good.

这篇关于我应该在词法分析器和解析器之间划清界限吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆