如何使用类似的词法分析器 [英] How to use similar lexers

查看:38
本文介绍了如何使用类似的词法分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下语法:

cmds
    : cmd+
    ;

cmd
    : include_cmd  |  other_cmd
    ;

include_cmd
    : INCLUDE  DOUBLE_QUOTE  FILE_NAME  DOUBLE_QUOTE
    ;

other_cmd
    : CMD_NAME  ARG+
    ;


INCLUDE
    : '#include'
    ;

DOUBLE_QUOTE
    : '"'
    ;

CMD_NAME
    : ('a'..'z')*
    ;

ARG
    : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
    ;

FILE_NAME
    : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
    ;

所以CMD_NAME、ARG和FILE_NAME的区别不大,CMD_NAME必须是小写字母,ARG可以有大写字母,_"和FILE_NAME却可以有.".

So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".

但这有问题,当我用-#include "abc" 测试规则时,'abc' 被解释为 CMD_NAME 而不是 FILE_NAME,我认为这是因为 CMD_NAME 在语法文件中的 FILE_NAME 之前,这导致解析错误.

But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.

我是否必须依靠预测这样的技术来处理这个问题?除了依赖宿主编程语言之外,还有没有纯粹的EBNF解决方案?

Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?

谢谢.

推荐答案

但这有一个问题,当我用 - #include "abc" 测试规则时,'abc' 被解释为 CMD_NAME 而不是 FILE_NAME,我认为这是因为 CMD_NAME 在语法文件中的 FILE_NAME 之前,这导致解析错误.

But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.

所有有效 CMD_NAME 的集合与所有有效 FILE_NAME 的集合相交.输入 abc 符合两者的条件.词法分析器将输入与列出的第一个规则(正如您怀疑的那样)匹配,因为它是第一个匹配的.

The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.

我是否必须依靠诸如[谓词]之类的技术来处理这个问题?除了依赖宿主编程语言之外,还有没有纯粹的EBNF解决方案?

Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?

这取决于您愿意接受的语法.考虑将您的 include_cmd 规则更改为更常规的规则,如下所示:

It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:

include_cmd : INCLUDE STRING;

STRING 
    : '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
    ;

现在输入 #include "abc" 变成标记 [INCLUDE : #include] [STRING : abc].

Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].

我不认为语法应该负责确定文件名是否有效:有效的文件名称并不意味着有效的文件,并且语法必须了解可能与语法本身无关的操作系统文件命名约定(有效字符、路径等).我认为,如果您愿意放弃规则 FILE_NAME 来处理类似上述规则的事情,那就没问题了.

I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.

另外值得注意的是,您的 CMD_NAME 规则匹配零长度输入.考虑将 ('a'..'z')* 改为 ('a'..'z')+ 除非 CMD_NAME 真的可以为空.

Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.

还要记住,使用 ARG 时您会遇到与使用 FILE_NAME 时相同的问题.它列在 CMD_NAME 之后,因此任何符合这两个规则的输入(再次像 abc)都将命中 CMD_NAME.考虑将这些规则分解为更传统的规则,如下所示:

Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:

other_cmd : ID (ID | NUMBER)+ SEMI;   //instead of CMD_NAME ARG+
ID        : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER    : ('0'..'9')+;              //"number" part of ARG
SEMI      : ';';

我添加了规则 SEMI 来标记命令的结束.否则解析器将不知道输入 abcd 应该是一个带有三个参数的命令 (a(b,c,d)) 还是两个带有一个参数的命令(a(b), c(d)).

I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).

这篇关于如何使用类似的词法分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆