用PLY解析令牌 [英] Parsing tokens with PLY
问题描述
一段时间以来,我一直在尝试使用PLY解析某些给定的文本,但我一直无法弄清楚.我定义了这些令牌:
I've been trying to parse some given text with PLY for a while and I haven't been able to figure it out. I have these tokens defined:
tokens = ['ID', 'INT', 'ASSIGNMENT']
我想将找到的单词分类为这些标记.例如,如果给定扫描仪:
And I want to classify the words I find into these tokens. For example, if the scanner is given:
var = 5
它应打印以下内容:
ID : 'var'
ASSIGNMENT : '='
INT : 5
这很好用.问题是当程序获得以下文本时:
This works just fine. The problem is when the program is given the following text:
9var = 5
此输出为:
INT : 9
ID : 'var'
ASSIGNMENT : '='
INT : 5
这是出问题的地方.它应该以9var作为ID,并且根据ID正则表达式,这不是ID的有效名称.这些是我的正则表达式:
This is where it goes wrong. It should take 9var as an ID, and according to the ID regex, that is not a valid name for an ID. These are my regular expressions:
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
return t
def t_INT(t):
r'\d+'
t.value = int(t.value)
return t
t_ASSIGNMENT = r'\='
我该如何解决?
您的帮助将不胜感激!
推荐答案
您说:应该以9var
作为ID".但是然后您指出9var
与ID regex模式不匹配.那么为什么9var
应该作为ID进行扫描?
You say: "It should take 9var
as an ID". But then you point out that 9var
doesn't match the ID regex pattern. So why should 9var
be scanned as an ID?
如果您希望9var
为ID,那么将正则表达式从[a-zA-Z_][a-zA-Z_0-9]*
更改为[a-zA-Z_0-9]+
就足够容易了. (这也将匹配纯整数,因此您需要确保首先应用INT模式.或者,可以使用[a-zA-Z_0-9]*[a-zA-Z_][a-zA-Z_0-9]*
.)
If you want 9var
to be an ID, it would be easy enough to change the regex, from [a-zA-Z_][a-zA-Z_0-9]*
to [a-zA-Z_0-9]+
. (That will also match pure integers, so you'd need to ensure that the INT pattern is applied first. Alternatively, you could use [a-zA-Z_0-9]*[a-zA-Z_][a-zA-Z_0-9]*
.)
我怀疑您真正想要的是将9var
识别为词法错误而不是解析错误.但是,无论如何,如果将其识别为错误,那么是词法错误还是语法错误真的很重要吗?
I suspect that what you really want is for 9var
to be recognized as a lexical error rather than a parsing error. But if it is going to be recognized as an error in any case, does it really matter whether it is a lexical error or a syntax error?
值得一提的是,Python词法分析器的工作方式与您的词法分析器完全相同:它将作为两个标记扫描9var
,稍后将产生语法错误.
It's worth mentioning that the Python lexer works exactly the way your lexer does: it will scan 9var
as two tokens, and that will later create a syntax error.
当然,在您的语言中,可能有一些语法正确的构造,其中ID可以直接跟在INT后面.或者,如果不是,则关键字可以直接跟在INT之后,例如Python表达式3 if x else 2
. (同样,如果您将其写为3if x else 2
,Python也不会抱怨.)
Of course, it is possible that in your language, there is some syntactically correct construction in which an ID can directly follow an INT. Or, if not, where a keyword can directly follow an INT, such as the Python expression 3 if x else 2
. (Again, Python doesn't complain if you write that as 3if x else 2
.)
因此,如果您真的真的坚持要为以数字开头和以非数字开头的令牌标记扫描程序错误,则可以插入另一种模式,例如[0-9]+[a-zA-Z_][a-zA-Z_0-9]*
,并使其动作出错.
So if you really really insist on flagging a scanner error for tokens which start with a digit and continue with non-digits, you can insert another pattern, such as [0-9]+[a-zA-Z_][a-zA-Z_0-9]*
, and have it raise an error in its action.
这篇关于用PLY解析令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!