如何使 antlr4 完全标记化终端节点 [英] How to make antlr4 fully tokenize terminal nodes
问题描述
我正在尝试使用 Antlr 来制作一个非常简单的解析器,它基本上对一系列 .
分隔的标识符进行标记.
我做了一个简单的语法:
r : STRUCTURE_SELECTOR ;STRUCTURE_SELECTOR: '.'(ID STRUCTURE_SELECTOR?)?;ID : [_a-z0-9$]* ;WS : [ \t\r\n]+ ->跳过 ;
当解析器生成时,我最终得到一个代表字符串的单个终端节点,而不是能够找到更多的 STRUCTURE_SELECTOR
s.我希望看到一个序列(可能表示为当前节点的子节点).我怎样才能做到这一点?
举个例子:
.
将产生一个文本为.
的终端节点.foobar
将产生两个节点,一个带有文本.
的父节点和一个带有文本foobar
的子节点.foobar.baz
将产生四个节点,一个带有文本.
的父节点,一个带有文本foobar
的子节点,一个二级子节点带有文本.
,以及带有文本baz
. 的第三级子级
以大写字母开头的规则是 Lexer 规则.
使用以下输入文件 t.text
<预><代码>..foobar.foobar.baz您的语法(在文件 Question.g4 中)产生以下标记
$ grun Question r -tokens -diagnostics t.text[@0,0:0='.',,1:0][@1,2:8='.foobar',,2:0][@2,10:20='.foobar.baz',,3:0][@3,22:21='',,4:0]
词法分析器(解析器)是贪婪的.它尝试使用规则读取尽可能多的输入字符(标记).词法分析器规则 STRUCTURE_SELECTOR: '.'(ID STRUCTURE_SELECTOR?)?
可以读取一个点、一个 ID,然后再读取一个点和一个 ID(由于重复标记 ?
),直到 NL.这就是为什么每一行都以一个标记结束.
编译语法时,报错
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
来是因为ID的重复标记是*
(表示0次或多次)而不是+
(一次或多次).
现在试试这个语法:
语法问题;r@init {System.out.println("问题最后更新 2135");}:(结构选择器NL)+ EOF;结构选择器:'.'|'.'ID结构选择器*;ID : [_a-z0-9$]+ ;NL : [\r\n]+ ;WS : [ \t]+ ->跳过 ;$ grun 问题 r -tokens -diagnostics t.text[@0,0:0='.',<'.'>,1:0][@1,1:1='\n',,1:1][@2,2:2='.',<'.'>,2:0][@3,3:8='foobar',,2:1][@4,9:9='\n',,2:7][@5,10:10='.',<'.'>,3:0][@6,11:16='foobar',,3:1][@7,17:17='.',<'.'>,3:7][@8,18:20='baz',,3:8][@9,21:21='\n',,3:11][@10,22:21='',,4:0]问题最后更新 2135第 3:7 行 reportAttemptingFullContext d=1 (structure_selector), input='.'第 3:7 行 reportContextSensitivity d=1 (structure_selector), input='.'
和 $ grun Question r -gui t.text
显示您期望的分层树结构.
I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .
-delimited identifiers.
I've made a simple grammar:
r : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;
WS : [ \t\r\n]+ -> skip ;
When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTOR
s. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?
As an example:
.
would yield one terminal node whose text is.
.foobar
would yield two nodes, a parent with text.
and a child with textfoobar
.foobar.baz
would yield four nodes, a parent with text.
, a child with textfoobar
, a second-level child with text.
, and a third-level child with textbaz
.
Rules starting with a capital letter are Lexer rules.
With the following input file t.text
.
.foobar
.foobar.baz
your grammar (in file Question.g4) produces the following tokens
$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',<STRUCTURE_SELECTOR>,1:0]
[@1,2:8='.foobar',<STRUCTURE_SELECTOR>,2:0]
[@2,10:20='.foobar.baz',<STRUCTURE_SELECTOR>,3:0]
[@3,22:21='<EOF>',<EOF>,4:0]
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)?
can read a dot, an ID, and again a dot and an ID (due to repetition marker ?
), till the NL. That's why each line ends up in a single token.
When compiling the grammar, the error
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
comes because the repetition marker of ID is *
(which means 0 or more times) instead of +
(one or more times).
Now try this grammar :
grammar Question;
r
@init {System.out.println("Question last update 2135");}
: ( structure_selector NL )+ EOF
;
structure_selector
: '.'
| '.' ID structure_selector*
;
ID : [_a-z0-9$]+ ;
NL : [\r\n]+ ;
WS : [ \t]+ -> skip ;
$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',<'.'>,1:0]
[@1,1:1='\n',<NL>,1:1]
[@2,2:2='.',<'.'>,2:0]
[@3,3:8='foobar',<ID>,2:1]
[@4,9:9='\n',<NL>,2:7]
[@5,10:10='.',<'.'>,3:0]
[@6,11:16='foobar',<ID>,3:1]
[@7,17:17='.',<'.'>,3:7]
[@8,18:20='baz',<ID>,3:8]
[@9,21:21='\n',<NL>,3:11]
[@10,22:21='<EOF>',<EOF>,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'
and $ grun Question r -gui t.text
displays the hierarchical tree structure you are expecting.
这篇关于如何使 antlr4 完全标记化终端节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!