Antlr4 不能正确识别 unicode 字符 [英] Antlr4 doesn't correctly recognizes unicode characters
问题描述
我有一个非常简单的语法,试图将é"与标记 E_CODE 匹配.我已经使用 TestRig 工具(带有 -tokens 选项)对其进行了测试,但解析器无法正确匹配它.我的输入文件是用没有 BOM 的 UTF-8 编码的,我使用了 ANTLR 4.4 版.其他人也可以检查一下吗?我在控制台上得到了这个输出:
第 1:0 行标记识别错误:'Ă'
I've very simple grammar which tries to match 'é' to token E_CODE.
I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it.
My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4.
Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'
grammar Unicode;
stat:EOF;
E_CODE: '\u00E9' | 'é';
推荐答案
我测试了语法:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '\u00E9' | 'é';
如下:
UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
以下内容打印到我的控制台:
and the following got printed to my console:
éé<EOF>
已使用 4.2 和 4.3 进行测试(4.4 尚未在 Maven Central 中).
Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).
看源 我看到 TestRig 需要一个可选的 -encoding
参数.你试过设置吗?
Looking at the source I see TestRig takes an optional -encoding
param. Have you tried setting it?
这篇关于Antlr4 不能正确识别 unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!