代表语法中以语句结尾的换行符? [英] Representing statement-terminating newlines in a grammar?

查看:213
本文介绍了代表语法中以语句结尾的换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多编程语言的语句都以行尾结尾.但是,通常情况下,如果解析器无法理解该行,则在语句中间允许使用行尾.例如

A lot of programming languages have statements terminated by line-endings. Usually, though, line endings are allowed in the middle of a statement if the parser can't make sense of the line; for example,

a = 3 +
4

...将在Ruby和Python *中解析为语句a = 3+4,因为a = 3+没有任何意义.换句话说,换行符会被忽略,因为它会导致解析错误.

...will be parsed in Ruby and Python* as the statement a = 3+4, since a = 3+ doesn't make any sense. In other words, the newline is ignored since it leads to a parsing error.

我的问题是:如何使用令牌生成器和解析器简单/优雅地实现相同的行为?我将Lemon用作解析器生成器,如果有什么区别(尽管我也将此问题标记为yacc,因为我确定该解决方案同样适用于两个程序).

My question is: how can I simply/elegantly accomplish that same behavior with a tokenizer and parser? I'm using Lemon as a parser generator, if it makes any difference (though I'm also tagging this question as yacc since I'm sure the solution applies equally to both programs).

这是我现在的操作方式:在没有句法歧义的任何情况下,允许有选择地出现一个语句终止符.换句话说,类似

Here's how I'm doing it now: allow a statement terminator to occur optionally in any case where there wouldn't be syntactic ambiguity. In other words, something like

expression ::= identifier PLUS identifier statement_terminator.
expression ::= identifier PLUS statement_terminator identifier statement_terminator.

...换句话说,可以在加号后使用换行符,因为这不会对语法的歧义产生任何影响.我担心这会扩大语法的大小,并且我有很多机会错过案例或引入语法中的细微错误.有没有更简单的方法可以做到这一点?

... in other words, it's ok to use a newline after the plus because that won't have any effect on the ambiguity of the grammar. My worry is that this would balloon the size of the grammar and I have a lot of opportunities to miss cases or introduce subtle bugs in the grammar. Is there an easier way to do this?

EDIT *:实际上,该代码示例不适用于Python.但是,如果传递这样的内容,Python实际上会忽略换行符:

EDIT*: Actually, that code example won't work for Python. Python does in fact ignore the newline if you pass in something like this, though:

print (1, 2,
3)

推荐答案

您可能可以使解析器生成器获得正确的显示,但可能需要修改解析器生成器的框架.

You could probably make a parser generator get this right, but it would probably require modifying the parser generator's skeleton.

我知道三种可行的算法;没有一个是完美的.

There are three plausible algorithms I know of; none is perfect.

  1. 在以下情况下,在行尾插入一个显式语句终止符:

  1. Insert an explicit statement terminator at the end of the line if:

a.前一个令牌不是语句终止符,并且

a. the previous token wasn't a statement terminator, and

b.可以移动语句终止符.

b. it would be possible to shift the statement terminator.

在以下情况下,在不可移动的令牌(在Ecmascript中为违规令牌")之前插入显式语句终止符:

Insert an explicit statement terminator prior to an unshiftable token (the "offending token", in Ecmascript speak) if:

a.有问题的令牌位于行首,或者是}或是输入结束令牌,并且

a. the offending token is at the beginning of a line, or is a } or is the end-of-input token, and

b.转移语句终止符不会导致空语句产生的减少. [1]

b. shifting a statement terminator will not cause a reduction by the empty-statement production. [1]

清点所有令牌对.对于每个令牌对,决定是否适合使用语句终止符替换行尾.您可以使用上述算法之一来计算该表.

Make an inventory of all token pairs. For every token pair, decide whether it is appropriate to replace a line-end with a statement terminator. You might be able to compute this table by using one of the above algorithms.

算法3最容易实现,但最难解决.而且,您每次修改语法时都可能需要调整表格,这将大大增加修改语法的难度.如果可以计算标记对表,则词法分析器可以处理插入语句终止符. (如果您的语法是运算符优先级语法,则可以在没有优先级关系的任何一对标记之间插入语句终止符.但是,即使这样,您也可能希望对受限上下文进行一些调整.)

Algorithm 3 is the easiest to implement, but the hardest to work out. And you may need to adjust the table every time you modify the grammar, which will considerably increase the difficulty of modifying the grammar. If you can compute the table of token pairs, then inserting statement terminators can be handled by the lexer. (If your grammar is an operator precedence grammar, then you can insert a statement terminator between any pair of tokens which do not have a precedence relationship. However, even then you may wish to make some adjustments for restricted contexts.)

如果可以在不破坏上下文的情况下向解析器查询令牌的可移动性,则可以在解析器中实现算法1和2.野牛的最新版本允许您指定它们所谓的"LAC"(LookAhead校正),这涉及到这一点.从概念上讲,解析器堆栈已被复制,解析器尝试处理令牌.如果令牌最终(可能经过一定程度的减少)后移位而未触发错误产生,则该令牌是有效提前行的一部分.我没有看过实现,但是很明显,实际上没有必要复制堆栈来计算可移动性.无论如何,如果您想使用该设施,则必须将其反向工程到Lemon中,这将是一个有趣的练习,可能不太困难. (您还需要修改bison骨架来执行此操作,但是从LAC实施开始可能会更容易.LAC当前仅由bison用来生成更好的错误消息,但确实涉及测试每个令牌的可移动性.)

Algorithms 1 and 2 can be implemented in the parser if you can query the parser about the shiftability of a token without destroying the context. Recent versions of bison allow you to specify what they call "LAC" (LookAhead Correction), which involves doing just that. Conceptually, the parser stack is copied and the parser attempts to handle a token; if the token is eventually shifted, possibly after some number of reductions, without triggering an error production, then the token is part of the valid lookahead. I haven't looked at the implementation, but it's clear that it's not actually necessary to copy the stack to compute shiftability. Regardless, you'd have to reverse-engineer the facility into Lemon if you wanted to use it, which would be an interesting exercise, probably not too difficult. (You'd also need to modify the bison skeleton to do this, but it might be easier starting with the LAC implementation. LAC is currently only used by bison to generate better error messages, but it does involve testing shiftability of every token.)

在上述所有算法中,需要注意的一件事是可能以括号括起来的表达式开头的语句.特别是Ecmascript弄错了该消息(IMHO). Ecmascript示例,直接出自报告:

One thing to watch out for, in all of the above algorithms, is statements which may start with parenthesized expressions. Ecmascript, in particular, gets this wrong (IMHO). The Ecmascript example, straight out of the report:

a = b + c
(d + e).print()

Ecmascript会将其解析为单个语句,因为c(d + e)是语法上有效的函数调用.因此,(并不是有问题的令牌,因为它可以移动.但是,程序员不太可能打算这样做,并且如果执行了代码,则在执行之前不会产生任何错误.

Ecmascript will parse this as a single statement, because c(d + e) is a syntactically valid function call. Consequently, ( is not an offending token, because it can be shifted. It's pretty unlikely that the programmer intended that, though, and no error will be produced until the code is executed, if it is executed.

请注意,算法1将在第一行的末尾插入语句终止符,但是类似地不会标记歧义.这很可能是程序员想要的,但是毫无疑问的模棱两可仍然令人讨厌.

Note that Algorithm 1 would have inserted a statement terminator at the end of the first line, but similarly would not flag the ambiguity. That's more likely to be what the programmer intended, but the unflagged ambiguity is still annoying.

Lua 5.1会将上面的示例视为错误,因为它不允许在调用表达式中的函数对象和(之间插入新行.但是,Lua 5.2的行为类似于Ecmascript.

Lua 5.1 would treat the above example as an error, because it does not allow new lines in between the function object and the ( in a call expression. However, Lua 5.2 behaves like Ecmascript.

另一个经典的歧义是return(可能还有其他语句),它们具有 optional 表达式.在Ecmascript中,return <expr>是受限制的产品.关键字和表达式之间不允许使用换行符,因此,在行末的return会自动插入分号.在Lua中,它不是模棱两可的,因为return语句不能跟在另一个语句之后.

Another classical ambiguity is return (and possibly other statements) which have an optional expression. In Ecmascript, return <expr> is a restricted production; a newline is not permitted between the keyword and the expression, so a return at the end of a line has a semicolon automatically inserted. In Lua, it's not ambiguous because a return statement cannot be followed by another statement.

注意:

  1. Ecmascript还要求将语句终止符标记解析为语句终止符,尽管并没有这么说.它不允许自动插入for语句的iterator子句中的分号.其算法还包括在两种情况下的强制分号插入:出现在行末的return/throw/continue/break标记之后和出现在行开始的++/--标记之前.
  1. Ecmascript also requires that the statement terminator token be parsed as a statement terminator, although it doesn't quite say that; it does not allow the semicolons in the iterator clause of a for statement to be inserted automatically. Its algorithm also includes mandatory semicolon insertion in two context: after a return/throw/continue/break token which appears at the end of a line, and before a ++/-- token which appears at the beginning of a line.

这篇关于代表语法中以语句结尾的换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆