Antlr4无法正确识别Unicode字符 [英] Antlr4 doesn't correctly recognizes unicode characters
问题描述
我有一个非常简单的语法,试图将'é'与令牌E_CODE匹配.
我已经使用TestRig工具(带有-tokens选项)对其进行了测试,但是解析器无法正确匹配它.
我的输入文件是使用UTF-8编码的,没有BOM,并且我使用的是ANTLR 4.4版.
其他人也可以检查一下吗?我在控制台上得到了以下输出:
第1:0行令牌识别错误:Ă"
I've very simple grammar which tries to match 'é' to token E_CODE.
I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it.
My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4.
Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'
grammar Unicode;
stat:EOF;
E_CODE: '\u00E9' | 'é';
推荐答案
我测试了语法:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '\u00E9' | 'é';
如下:
UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
,以下内容已打印到我的控制台上:
and the following got printed to my console:
éé<EOF>
使用4.2和4.3进行了测试(4.4还没有在Maven Central中使用).
Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).
Looking at the source I see TestRig takes an optional -encoding
param. Have you tried setting it?
这篇关于Antlr4无法正确识别Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!