在python PLY(lex/yacc)中使用空生产规则的语法错误 [英] Syntax error using empty production rule in python PLY(lex/yacc)

查看:138
本文介绍了在python PLY(lex/yacc)中使用空生产规则的语法错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

完整示例如下:

import ply.lex as lex
import Property
# List of token names.   This is always required
tokens = [
    'CheckupInformation',
    'Introduction',
    'Information',
    'perfect',
    'sick',
    'LPAREN',
    'RPAREN',
    'CHAR',
    'NUMBER'
    ] 
def t_CheckupInformation(t)     : 'CheckupInformation'     ; return t
def t_Introduction(t)  : 'Introduction'  ; return t
def t_Information(t) : 'Information' ; return t
def t_perfect(t): 'perfect'; return t
def t_sick(t) : 'sick'; return t



t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_CHAR = r'[a-zA-Z_][a-zA-Z0-9_\-]*'
t_ignore = " \t"
# Define a rule so we can track line numbers

def t_NUMBER(t):
    r'[+\-0-9_][0-9_]*'
    t.lexer.lineno += len(t.value)
    try:
        t.value = int(t.value)
    except ValueError:
        print("Integer value too large %s" % t.value)
        t.value = 0
    return t
def t_SEMICOLON(t):
    r'\;.*'
    t.lexer.lineno += len(t.value)
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)
# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

 # Build the lexer
lexer = lex.lex()
# define upper level classes first     
class stat:
    def __init__(self):
        self.statement = ""
        self.intro = list()
        self.body = list()


P=stat()
def p_stat(p):
    'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
    p[0]=(p[1],p[2],p[3],p[4],p[5])

def p_Intro(p) : 
    '''statIntro : LPAREN Introduction Name RPAREN
                 | statIntro LPAREN Introduction Name RPAREN
                 | empty'''

    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    else:
       p[0]= None
    P.intro.append(p[0])

def p_Name(p):
    'Name : CHAR'
    p[0]=p[1]



def p_Body(p):
    '''statBody : LPAREN Information bodyinfo RPAREN
                | statBody LPAREN Information bodyinfo RPAREN'''
    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    P.body.append(p[0])
def p_bodyinfo(p):
    '''bodyinfo : LPAREN CHAR perfect RPAREN
                | LPAREN CHAR sick RPAREN'''
    p[0]=p[2],p[3]


def p_empty(p):
    'empty :  '
    print("This function is called")
    pass   
def p_error(p):
    print("Syntax error in input '%s'!" % p.value)

import ply.yacc as yacc
parser = yacc.yacc()
import sys
if len(sys.argv) < 2 :
    sys.exit("Usage: %s <filename>" % sys.argv[0])
fp = open(sys.argv[1])
contents=fp.read()
result=parser.parse(contents)

print("(CheckupInformation")
if (P.intro) != None:
    for x in range(len(P.intro)):
        print("    (Introduction %s)" %(P.intro[x]))
for x in range(len(P.body)):
        print("    (Information( %s %s))" %(P.body[x]))
print(")")

文本File1为:

(CheckupInformation
  (Introduction John)
  (Introduction Patt)
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

文本File2为:

(CheckupInformation
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

根据我的bnf语法,简介"是可选的.带有文本文件1的上部代码运行良好.但是,当我从文本文件中将"Intro"部分作为文本file2删除时,它在"body"部分给出了语法错误,这意味着它无法处理空的生产.请帮助我如何解决该错误.如何处理我的代码的空生产规则?

According to my bnf grammar 'Intro' is optional. The upper code with the text file1 works well. But when I remove the 'Intro' part from text file as text file2, it gives syntax error at 'body' section that means it cannot handle empty production. Please help me how to solve the error. How to handle the empty production rule for my code?

错误消息:错误消息摘要

推荐答案

您的程序无法运行,因为您 import Property (它不是标准库模块).但是删除该行至少足以达到Ply试图构建解析器的地步,此时它会生成一些警告,包括移位/减少冲突警告.最后的警告很重要.您应该尝试修复它,并且您当然不应忽略它.(这意味着您应该将其报告为问题的一部分.)正是这种冲突使解析器无法正常工作.

Your program cannot be run because you import Property, which is not a standard library module. But deleting that line is at least sufficient to get to the point where Ply attempts to build the parser, at which point it generates several warnings, including a shift/reduce conflict warning. This last warning is important; you should attempt to fix it, and you certainly should not ignore it. (Which means that you should report it as part of your question.) It is this conflict which is preventing your parser from working.

这是它的意思:

WARNING: Token 'NUMBER' defined, but not used
WARNING: There is 1 unused token
Generating LALR tables
WARNING: 1 shift/reduce conflict

Ply生成文件 parser.out ,其中包含有关您的语法的更多信息,包括对移位/减少冲突的详细描述.检查该文件,我们发现以下内容:

Ply generates the file parser.out, which includes more information about your grammar, including a detailed description of the shift/reduce conflict. Examining that file, we find the following:

state 3

    (1) Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN
    (2) statIntro -> . LPAREN Introduction Name RPAREN
    (3) statIntro -> . statIntro LPAREN Introduction Name RPAREN
    (4) statIntro -> . empty
    (10) empty -> .

  ! shift/reduce conflict for LPAREN resolved as shift
    LPAREN          shift and go to state 4

  ! LPAREN          [ reduce using rule 10 (empty -> .) ]

    statIntro                      shift and go to state 5
    empty                          shift and go to state 6

解析器此时在处理 Stat 时进入状态3:

The parser enters State 3 when it is at this point in the processing of Stat:

Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN

CheckupInformation statIntro 之间的点表示进度.在中间带有点的状态下,可能会有不止一个产品,这意味着解析器还没有找出要选择的那些选择中的哪一个.也有以点开头的作品;这些将对应于紧接点之后的非末端,表明现在需要考虑这些产生.

The dot between CheckupInformation and statIntro indicates the progress. There might be more than one production in a state with dots in the middle, which means that the parser has not yet had to figure out which of those alternatives to pick. There are also productions with the dot at the beginning; these will correspond to the non-terminal(s) which immediately follow the dot(s), indicating that those productions now need to be considered.

也可能在结果的末尾加点,这表示在解析的这一点上,遇到的符号序列可以简化"为相应的非终结符.换句话说,解析器已经识别出该非终结符.

There may also productions with the dot at the end, which indicates that at this point in the parse, the sequence of symbols encountered can be "reduced" to the corresponding non-terminal. In other words, the parser has recognised that non-terminal.

必须在被识别时执行还原操作,或者根本不执行.如果以下令牌(超前令牌")无法跟随要还原的非终结符,则可能无法执行还原.通常,解析器需要考虑以下问题,可以通过查询状态转换表(在生产之后立即显示这些问题)来立即回答这些问题:

Reductions must be performed when they are recognised, or not at all. A reduction might not be performed, if the following token -- the "lookahead token" -- cannot follow the non-terminal to be reduced. In general, the parser needs to consider the following questions, which can be immediately answered by consulting the state transition table (these are shown immediately following the productions):

  1. 可以通过继续执行下一个标记而不进行缩减来继续进行解析吗?(这被称为转移"动作,因为解析器将一个标记在活动生产中向右移.)
  2. 对于此状态下的每个可能的缩减,执行该缩减都可以解析进行吗?

如果对多个问题中的一个以上的回答为是",则会发生冲突.这并不一定意味着语法是模棱两可的,但确实意味着解析器无法决定如何在这两种选择之间进行选择.

A conflict occurs if the answer to more than one of these questions is "yes". That doesn't necessarily mean that the grammar is ambiguous, but it does mean that the parser cannot decide how to choose between the two alternatives.

像大多数解析器生成器一样,Ply 使用一些内置规则解决了这个问题.这些规则之一是,在没有其他信息(优先声明)的情况下,如果第一个问题的答案为是",则解析器应继续进行而不进行任何简化.

Like most parser generators, Ply resolves this question using some built-in rules. One of these rules is that in the absence of other information (precedence declarations), if the answer to the first question was "yes", the parser should proceed without performing any reductions.

在此状态的特定示例中,可以减少的是 empty:.尽管从该状态尚不明显(我们必须查看解析器在进行这种还原后进入的状态,即状态6),但在减小 empty 之后,解析器的下一步行动必然是减少 statIntro ->空,然后进入状态5,其中包括生产

In the particular example of this state, the reduction which could be made is empty :. Although it's not obvious from this state (we'd have to look at the state the parser enters after doing that reduction, which is state 6), after reducing empty, the parser's next move will necessarily be to reduce statIntro -> empty, after which it will go to State 5, which includes the production

Stat -> LPAREN CheckupInformation statIntro . statBody RPAREN

为了使该序列有效,解析器需要知道它将能够继续进行,这意味着提前标记(在这种情况下为())必须是状态5.当然,这是因为 statBody 可以以一个开放的括号开头,因此可以采取减少的方法.

In order for that sequence to be valid, the parser needs to know that it will be able to progress, which means that the lookahead token (in this case () must be a possible input in State 5. Of course, it is because statBody can start with an open parenthesis. So the reduction could be taken.

但是 statIntro 也可以以 ( 开头,因此解析器不必为了前进而进行归约.鉴于这两种选择,Ply 选择采用shift 操作,这意味着它放弃 statIntro 可能为空的可能性,并假定 ( 属于 statIntro.如果存在 statIntro ,这是正确的选择.但是,如果缺少 statIntro ,则(属于statBody ,并且应该采取减少措施.

But statIntro could also begin with a (, so the parser does not have to do the reduction in order to progress. Given those two alternatives, Ply chooses to take the shift action, which means that it discards the possibility that statIntro could be empty and assumes that the ( belongs to a statIntro. If there is a statIntro, this is the correct choice. But if statIntro was missing, the ( belongs to statBody, and the reduction should have been taken.

这就是您的语法问题.这表明所写的语法需要多个预读标记.不幸的是,包括Ply在内的许多解析器生成器都没有一种机制来应付需要多个前瞻标记的语法.(如果所需的前瞻性数量有一定限制-例如,在这种情况下,可以通过查看接下来的两个标记来解决冲突-那么从理论上讲,可以为相同的语言找到等效的语法只需要一个先行令牌.但这将是您的责任,因为Ply不会为您这样做.)

So that's the problem with your grammar. It's an indication that the grammar, as written, needs more than one token of lookahead. Unfortunately, many parser generators, including Ply, do not have a mechanism to cope with grammars which need more than one lookahead token. (If there is some limit to the amount of lookahead needed -- in this case, for example, the conflict could be resolved by looking at the next two tokens -- then it is theoretically possible to find an equivalent grammar for the same language which needs only one lookahead token. But that will have to be your responsibility, because Ply won't do it for you.)

在这种情况下,解决方案非常简单.只需要从 statIntro 中删除空的生产,而是通过为 Stat 提供两个生产,使其具有可选性,而其中一个具有 statIntro 而另一种则没有:

In this case, the solution is extremely simple. It is only necessary to remove the empty production from statIntro, and instead make it optional by providing two productions for Stat, one which has a statIntro and one which doesn't:

def p_stat_1(p):
    'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
    p[0]=(p[1],p[2],p[3],p[4],p[5])

def p_stat_2(p):
    'Stat : LPAREN CheckupInformation           statBody RPAREN'
    p[0]=(p[1],p[2],None,p[3],p[4])

def p_Intro(p) :
    '''statIntro : LPAREN Introduction Name RPAREN
                 | statIntro LPAREN Introduction Name RPAREN
    '''

(我还从语法中删除了 p_empty .)

(I also removed p_empty from the grammar.)

此修改后的语法不会产生任何冲突,并且将按预期解析您的测试输入:

This modified grammar does not produce any conflicts, and will parse your test inputs as expected:

$ python3 learner.py learner.1
(CheckupInformation
    (Introduction John)
    (Introduction Patt)
    (Information( Anonymous1 perfect))
    (Information( Anonymous2 sick))
)
$ python3 learner.py learner.2
(CheckupInformation
    (Information( Anonymous1 perfect))
    (Information( Anonymous2 sick))
)

后记:

上面建议的转换很简单,适用于很多情况,而不仅仅是可以通过两个标记的先行来解决冲突的情况.但是,正如评论中指出的那样,它的确增加了语法的大小,尤其是当产品具有多个可选组件时.例如生产:

The transformation suggested above is simple and will work in a large number of cases, not just cases where the conflict can be resolved with a lookahead of two tokens. But, as noted in a comment, it does increase the size of a grammar, particularly when productions have more than one optional component. For example, the production:

A : optional_B optional_C optional_D

将必须扩展为七个不同的产品,一个用于 BCD 的每个非空选择,另外还需要在每个使用 A 的地方复制以允许 A 为空的情况.

would have to be expanded into seven different productions, one for each non-empty selection from B C D, and in addition each place where A was used would need to be duplicated to allow for the case where the A was empty.

这似乎很多,但可能并非所有这些作品都是必需的.仅当可以启动可选组件的端子组和可以跟在其后的符号组之间存在重叠时,才需要进行转换.因此,例如,如果 B C D 都可以以括号开头,但不能遵循 A 用括号括起来,那么 optional_D 不会引起冲突,只需要扩展 B C :

That seems like a lot, but it might be that not all of these productions are necessary. The transformation is only needed if there is an overlap between the set of terminals which can start the optional component and the set of symbols which can follow it. So, for example, if B, C and D can all start with a parenthesis but A cannot be followed by a parenthesis, then optional_D will not cause a conflict, and only B and C would need to be expanded:

A : B C optional_D
  |   C optional_D
  | B   optional_D
  |     optional_D

这需要进行一些语法分析,以找出可以遵循 A 的内容,但是在常见的语法中,手工操作并不难.

That requires a bit of grammatical analysis to figure out what can follow A, but in common grammars that's not too hard to do by hand.

如果这仍然看起来太多了,那么还有其他几种可能性不那么普遍,但无论如何还是有帮助的.

If that still seems like too many productions, there are a couple of other possibilities which are less general, but which might help anyway.

首先,您可能会决定在上面的作品中显示 B C D 的顺序并不重要.在这种情况下,您可以替换

First, you might decide that it doesn't really matter what order B, C and D are presented in the above production. In that case, you could replace

A : optional_B optional_C optional_D

不太复杂,但更易于接受的替代方法:

with the not-too-complicated, somewhat more accepting alternative:

A : 
  | A X
X : B | C | D

(或者您可以通过在 A 的生​​产版本中逐个写出替代方案来避免 X .)

(or you could avoid X by writing out the alternatives individually in productions of A.)

这允许多次使用 B C D ,但是这似乎与您的语法相对应,其中可选组件是实际上可能是空的重复.

That allows multiple uses of B, C and D, but that appears to correspond to your grammar, in which the optional components are actually possibly empty repetitions.

留下了如何生成合理的AST的问题,但是至少在Ply的情况下,这很容易解决.这是一种可行的解决方案,再次假设重复是可以接受的:

That leaves the problem of how to produce a reasonable AST, but that's fairly easy to solve, at least in the context of Ply. Here's one possible practical solution, again assuming that repetition is acceptable:

# This solution assumes that A cannot be followed by
# any token which might appear at the start of a component
def p_A(p):
    """ A : A1 """
    # Create the list of lists B, C, D, as in the original
    p[0] = [ p[1]["B"], p[1]["C"], p[1]["D"] ]

def p_A1_empty(p):
    """ A1 : """
    # Start with a dictionary of empty lists
    p[0] = { "A":[], "B":[], "C":[] }

def p_A1_B(p):
    """ A1 : A1 B """
    p[1]["B"].append(p[2])
    p[0] = p[1]

def p_A1_C(p):
    """ A1 : A1 C """
    p[1]["C"].append(p[2])
    p[0] = p[1]

def p_A1_D(p):
    """ A1 : A1 D """
    p[1]["D"].append(p[2])
    p[0] = p[1]

如果您安排了 B C D 的语义值以包括以下内容,则可以简化最后三个动作函数:它们是什么.因此,例如,如果 B 返回了 ["B",value] 而不是 value ,则可以将后三个组合在一起A1 动作合并为一个函数:

You could simplify the last three action functions if you arranged for the semantic values of B, C and D to include an indication of what they are. So if, for example, B returned ["B", value] instead of value, then you could combine the last three A1 actions into a single function:

def p_A1_BCD(p):
    """ A1 : A1 B
           | A1 C
           | A1 D
    """
    p[1][p[2][0]].append(p[2][1])
    p[0] = p[1]

如果这些都不令人满意,并且可以使用一个额外的超前标记解决所有冲突,那么您可以尝试在词法分析器中解决该问题.

If none of that is satisfactory, and all of the conflicts can be resolved with one additional lookahead token, then you can try to solve the issue in the lexer.

例如,您的语言似乎完全由S表达式组成,这些S表达式以开括号开头,后跟某种关键字.因此,词法分析器可以将开括号和以下关键字组合到单个标记中.一旦这样做,您的可选组件便不再以相同的令牌开头,并且冲突也就消失了.(此技术通常用于解析具有相同问题的XML输入:如果所有有趣的事情都以< 开头,则冲突很多.但是如果您识别出< tag 作为单个标记,然后冲突就消失了.对于XML,在< 和标记名之间不允许有空格;如果您的语法允许在(和以下关键字,您的词法分析器模式将稍微复杂一些,但仍可管理.)

For example, your language seems to entirely consist of S-expressions which start with a open parenthesis followed by some kind of keyword. So the lexical analyser could combine the open parenthesis with the following keyword into a single token. Once you do that, your optional components no longer start with the same token and the conflict vanishes. (This technique is often used to parse XML inputs, which have the same issue: if everything interesting starts with a <, then conflicts abound. But if you recognise <tag as a single token, then the conflicts disappear. In the case of XML, whitespace is not allowed between the < and the tagname; if your grammar does allow whitespace between the ( and the following keyword, your lexer patterns will become slightly more complicated. But it's still manageable.)

这篇关于在python PLY(lex/yacc)中使用空生产规则的语法错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆