Antlr4:当有转义字符加回车、换行时单引号规则失败 [英] Antlr4: single quote rule fails when there are escape chars plus carriage return, new line
问题描述
我有这样的语法:
grammar Testquote;
program : (Line ';')+ ;
Line: L_S_STRING ;
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\''; // Single quoted string literal
L_WS : L_BLANK+ -> skip ; // Whitespace
fragment L_BLANK : (' ' | '\t' | '\r' | '\n') ;
这种语法——尤其是 L_S_STRING
——似乎适用于普通输入,例如:
This grammar--and the L_S_STRING
in particular--seems working fine with vanilla inputs like:
'ab';
'cd';
但是,此输入失败:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
'cd';
当我将第一行更改为任一'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z''';
或'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\' '
;
Yet works when I changed the first line to either
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z''';
or
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\' '
;
我大概明白为什么解析器会选择这条失败的路由.但是有什么方法可以告诉它做出不同的选择吗?
I sorta can see why the parser may choose this failed route. But is there some way I can tell it to choose differently?
推荐答案
根据 ANTLR4 文档,词法分析器和解析器规则都是贪婪,因此尽可能多地匹配输入强>.在你的情况下:
According to ANTLR4 docs, both lexer and parser rules are greedy, thus matching as much input as they can. In your case:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
^^^
'cd';
你的语法有点含糊 - 我突出显示的字符可以解释为 \'
'
或 \
''代码>.看看它是如何工作的.
Your grammar is somewhat ambiguous - the characters I've highlighted can be interpreted as \'
'
or as \
''
. See how it works.
如果没有 'cd'
,词法分析器匹配一个字符串,因为它是语法的有效字符串,突出显示的字符匹配为 \'
'
.但是由于词法分析器是贪婪的,它首先会使用前面提到的歧义来匹配不需要的输入,例如稍后在某处添加另一个未转义的 '
.
Without 'cd'
, lexer matches a string because it's a valid string for your grammar, highlighted characters are matched as \'
'
. But since lexer is greedy, it will use the aforementioned ambiguity to match unwanted input at first possibility, such as adding another unescaped '
somewhere later.
这种歧义是由反斜杠可能是正常字符或转义字符引起的.消除这种歧义的常见解决方案是转义反斜杠本身的规则:\\
,您还需要不将其作为正常字符进行匹配.
This ambiguity is caused by possibility of backslash being either normal character or escape character. The common solution for removing such ambiguity is a rule for escaping the backslash itself: \\
, also you need to not match it as a normal character.
或者,您可能希望以不同的方式处理歧义.如果你想把 \'
放在 ''
之上,你应该写:
Alternatively, you may want to deal with ambiguity in a different way. If you want to prioritize \'
over ''
, you should write:
L_S_STRING : '\'' ( ('\'\'') | ('\\'+ ~'\\') | ~('\'' | '\\') )* '\'' ;
它适用于您的输入.
顺便说一下,您可以缩短 L_WS 的代码:
By the way, you can shorten your code for L_WS:
L_WS : [ \t\n\r]+ -> skip ;
这篇关于Antlr4:当有转义字符加回车、换行时单引号规则失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!