Antlr4:当有转义字符加回车、换行时单引号规则失败 [英] Antlr4: single quote rule fails when there are escape chars plus carriage return, new line

查看:35
本文介绍了Antlr4:当有转义字符加回车、换行时单引号规则失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的语法:

grammar Testquote;
program : (Line ';')+ ;
Line: L_S_STRING ;
L_S_STRING  : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\''; // Single quoted string literal
L_WS        : L_BLANK+ -> skip ;   // Whitespace
fragment L_BLANK : (' ' | '\t' | '\r' | '\n') ;

这种语法——尤其是 L_S_STRING——似乎适用于普通输入,例如:

This grammar--and the L_S_STRING in particular--seems working fine with vanilla inputs like:

'ab';
'cd';

但是,此输入失败:

'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
'cd';

当我将第一行更改为任一'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z''';'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\' ';

Yet works when I changed the first line to either 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z'''; or 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\' ';

我大概明白为什么解析器会选择这条失败的路由.但是有什么方法可以告诉它做出不同的选择吗?

I sorta can see why the parser may choose this failed route. But is there some way I can tell it to choose differently?

推荐答案

根据 ANTLR4 文档,词法分析器和解析器规则都是贪婪,因此尽可能多地匹配输入强>.在你的情况下:

According to ANTLR4 docs, both lexer and parser rules are greedy, thus matching as much input as they can. In your case:

'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
                               ^^^
'cd';

你的语法有点含糊 - 我突出显示的字符可以解释为 \' '\ ''.看看它是如何工作的.

Your grammar is somewhat ambiguous - the characters I've highlighted can be interpreted as \' ' or as \ ''. See how it works.

如果没有 'cd',词法分析器匹配一个字符串,因为它是语法的有效字符串,突出显示的字符匹配为 \' '.但是由于词法分析器是贪婪的,它首先会使用前面提到的歧义来匹配不需要的输入,例如稍后在某处添加另一个未转义的 '.

Without 'cd', lexer matches a string because it's a valid string for your grammar, highlighted characters are matched as \' '. But since lexer is greedy, it will use the aforementioned ambiguity to match unwanted input at first possibility, such as adding another unescaped ' somewhere later.

这种歧义是由反斜杠可能是正常字符或转义字符引起的.消除这种歧义的常见解决方案是转义反斜杠本身的规则:\\,您还需要不将其作为正常字符进行匹配.

This ambiguity is caused by possibility of backslash being either normal character or escape character. The common solution for removing such ambiguity is a rule for escaping the backslash itself: \\, also you need to not match it as a normal character.

或者,您可能希望以不同的方式处理歧义.如果你想把 \' 放在 '' 之上,你应该写:

Alternatively, you may want to deal with ambiguity in a different way. If you want to prioritize \' over '', you should write:

L_S_STRING  : '\'' ( ('\'\'') | ('\\'+ ~'\\') | ~('\'' | '\\') )* '\'' ;

它适用于您的输入.

顺便说一下,您可以缩短 L_WS 的代码:

By the way, you can shorten your code for L_WS:

L_WS : [ \t\n\r]+ -> skip ;

这篇关于Antlr4:当有转义字符加回车、换行时单引号规则失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆