如何使用类似的词法分析器 [英] How to use similar lexers
问题描述
我有以下语法:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
因此CMD_NAME,ARG和FILE_NAME之间的差异不大,CMD_NAME必须为小写字母,ARG可以为大写字母,"_"和FILE_NAME可以为.".
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
但这有一个问题,当我使用-#include"abc"测试规则时,'abc'被解释为CMD_NAME而不是FILE_NAME,我认为这是因为CMD_NAME在语法文件中位于FILE_NAME之前,这导致解析错误.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
我是否必须依靠预测的技术来处理此问题?除了依赖主机编程语言之外,是否还有纯粹的EBNF解决方案?
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
谢谢.
推荐答案
但这有一个问题,当我使用-#include"abc"测试规则时,'abc'被解释为CMD_NAME而不是FILE_NAME,我认为这是因为CMD_NAME在语法文件中位于FILE_NAME之前,这导致解析错误.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
所有有效CMD_NAME
的集合与所有有效FILE_NAME
的集合相交.输入abc
符合这两个条件.词法分析器将输入与列出的第一个规则(如您所怀疑的)匹配,因为它是第一个匹配的.
The set of all valid CMD_NAME
s intersects with the set of all valid FILE_NAME
s. Input abc
qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
我是否必须依靠[谓词]这样的技术来处理此问题?除了依赖主机编程语言之外,是否还有纯粹的EBNF解决方案?
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
这取决于您愿意接受的语法.考虑将您的include_cmd
规则更改为更常规的内容,例如:
It depends on what you're willing accept in your grammar. Consider changing your include_cmd
rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
现在输入#include "abc"
变成令牌[INCLUDE : #include] [STRING : abc]
.
我认为语法不应该负责确定文件名是否有效:有效的文件 name 并不意味着有效的 file ,并且语法必须了解可能与语法本身无关的OS文件命名约定(有效字符,路径等).如果您愿意针对上述规则放弃规则FILE_NAME
,我认为您会没事的.
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME
for something like the rules the above.
同样值得注意的是,您的CMD_NAME
规则与零长度输入匹配.考虑将('a'..'z')*
更改为('a'..'z')+
,除非CMD_NAME
确实可以为空.
Also worth noting, your CMD_NAME
rule matches zero-length input. Consider changing ('a'..'z')*
to ('a'..'z')+
unless a CMD_NAME
really can be empty.
也请记住,与FILE_NAME
一样,您遇到的ARG
问题也会相同.它在CMD_NAME
之后列出,因此符合两个规则的所有输入(再次类似于abc
)都会命中CMD_NAME
.考虑将这些规则分解为更常规的规则,例如:
Keep in mind, too, that you'll have the same problem with ARG
that you did with FILE_NAME
. It's listed after CMD_NAME
, so any input that qualifies for both rules (like abc
again) will hit CMD_NAME
. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
我添加了规则SEMI
来标记命令的结尾.否则,解析器将不知道输入a b c d
是一个带有三个参数的命令(a(b,c,d)
)还是两个带有一个参数的命令(a(b), c(d)
).
I added rule SEMI
to mark the end of a command. Otherwise the parser won't know if input a b c d
is supposed to be one command with three arguments (a(b,c,d)
) or two commands with one argument each (a(b), c(d)
).
这篇关于如何使用类似的词法分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!