ANTLR4:在令牌规则中使用非 ASCII 字符 [英] ANTLR4: Using non-ASCII characters in token rules

查看:20
本文介绍了ANTLR4:在令牌规则中使用非 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 ANTRL4 书的第 74 页上说,任何 Unicode 字符都可以通过以这种方式指定其代码点而在语法中使用:

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:

'\uxxxx'

其中 xxxx 是 Unicode 代码点的十六进制值.

where xxxx is the hexadecimal value for the Unicode codepoint.

所以我在 ID 令牌的令牌规则中使用了该技术:

So I used that technique in a token rule for an ID token:

grammar ID;

id : ID EOF ;

ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;

当我尝试解析此输入时:

When I tried to parse this input:

Gŭnter

ANTLR 抛出一个错误,说它不能识别 ŭ.(ŭ 字符是十六进制 016D,所以在指定的范围内)

ANTLR throws an error, saying that it does not recognize ŭ. (The ŭ character is hex 016D, so it is within the range specified)

请问我做错了什么?

推荐答案

ANTLR 已准备好接受 16 位字符,但默认情况下,许多语言环境会将字符读取为字节(8 位).当您使用 Java 库读取文件时,您需要指定适当的编码.如果您正在使用 TestRig,也许通过别名/脚本 grun,然后使用参数 -encoding utf-8 或其他.如果您查看该类的源代码,您将看到以下机制:

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

这篇关于ANTLR4:在令牌规则中使用非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆