在从词法分析器中跳过 WS 时编写对空格敏感的解析器规则 [英] Writing parser rules sensitive to whitespace while skipping WS from the lexer

查看:25
本文介绍了在从词法分析器中跳过 WS 时编写对空格敏感的解析器规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在处理空格时遇到了一些麻烦.在下面的语法摘录中,我设置了词法分析器,以便解析器跳过空格:

I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:

ENTITY_VAR
    : 'user'
    | 'resource'
    ;

INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;

NEWLINE : '\r'? '\n';

WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines

fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );

问题是,我想匹配 ENTITY_ID 形式的变量名称,以便匹配的字符串没有任何空格.像我在这里所做的那样将它编写为词法分析器规则就足够了,但问题是我想用解析器规则来代替它,因为我想直接访问这两个标记 ENTITY_VARID 分别从我的代码中分离出来,而不是将它们挤回到整个令牌 ENTITY_ID 中.

The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.

有什么想法吗?基本上任何让我直接访问 ENTITY_VARID 的解决方案都适合我,无论是将 ENTITY_ID 保留为词法分析器规则还是将其移动到解析器.

Any ideas, please? Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.

推荐答案

我能想到的方法有几种(排名不分先后):

There are several approaches I can think of (not in a special order):

  1. 从规则 ENTITY_ID 发出多个令牌.请参阅ANTLR4:如何注入令牌以获取灵感
  2. 允许在解析器中有空格并在之后检查
  3. 使用单个令牌并在代码中拆分
  4. 使用单个令牌并将其传递给解析器之前修改令牌流.IE.lex,修改 ENTITY_ID 标记并将它们拆分为其他几个标记,然后将此流传递给解析器
  5. 不要跳过空格,并在处理这些额外标记"时检查它们是否在 ENTITY_ID 部分内(=> 是错误)或不在(=> 忽略错误).
  6. 不要跳过空格并在语法中允许空格的任何地方添加WS*"(如果语法不太大,也可以).
  7. 在解析器规则中插入谓词,以检查它们之间是否有空格.
  8. 像这样创建一个陷阱"规则:

  1. Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
  2. Allow whitespace in the parser and check afterwards
  3. Use the single token and split in code
  4. Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
  5. Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
  6. Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
  7. Insert predicates in the parser rule that checks if there is whitespace between.
  8. Create a "trap" rule like this:

INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
                  | '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
                  | '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
                  ;

这将捕获无效的 ENTITY_ID ,因为它比同时也是单个令牌的部分更长.

This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.

我会选择 2,如果它在非错误"情况下不改变解析,即没有代码通过允许空格被不同地解释.

I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.

这篇关于在从词法分析器中跳过 WS 时编写对空格敏感的解析器规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆