ANTLR4:在令牌规则中使用非ASCII字符 [英] ANTLR4: Using non-ASCII characters in token rules

查看:96
本文介绍了ANTLR4:在令牌规则中使用非ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在ANTRL4书的第74页上,它说,只需以这种方式指定其代码点,就可以在语法中使用任何Unicode字符:

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:

'\uxxxx'

其中xxxx是Unicode代码点的十六进制值.

where xxxx is the hexadecimal value for the Unicode codepoint.

因此,我在令牌规则中将这种技术用于ID令牌:

So I used that technique in a token rule for an ID token:

grammar ID;

id : ID EOF ;

ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;

当我尝试解析此输入时:

When I tried to parse this input:

Gŭnter

ANTLR引发错误,说它无法识别ŭ. (The字符为十六进制016D,因此在指定范围内)

ANTLR throws an error, saying that it does not recognize ŭ. (The ŭ character is hex 016D, so it is within the range specified)

我在做什么错了?

推荐答案

ANTLR准备接受16位字符,但是默认情况下,许多语言环境都会以字节(8位)的形式读取字符.使用Java库从文件中读取时,需要指定适当的编码.如果您正在使用TestRig,可能是通过别名/脚本grun,则使用参数-encoding utf-8或其他任何方法.如果您查看该类的源代码,则会看到以下机制:

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

这篇关于ANTLR4:在令牌规则中使用非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆