如何使antlr4完全标记终端节点 [英] How to make antlr4 fully tokenize terminal nodes

查看：200 发布时间：2020/9/3 0:13:12 antlr4

本文介绍了如何使antlr4完全标记终端节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Antlr制作一个非常简单的解析器，该解析器基本上会标记一系列由.分隔的标识符.

I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .-delimited identifiers.

我做了一个简单的语法:

I've made a simple grammar:

r  : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;             
WS : [ \t\r\n]+ -> skip ;

生成解析器时，我最终得到一个代表字符串的单个终端节点，而不是能够找到更多的STRUCTURE_SELECTOR.相反，我希望看到一个序列(可能表示为当前节点的子代).我该怎么做?

When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTORs. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?

例如:

.将产生一个文本为.
.foobar将产生两个节点，一个具有文本.的父节点和一个具有文本foobar
.foobar.baz将产生四个节点，父节点的文本为.，子节点的文本为foobar，第二级子节点的文本为.，第三级子节点的文本为baz.

. would yield one terminal node whose text is .
.foobar would yield two nodes, a parent with text . and a child with text foobar
.foobar.baz would yield four nodes, a parent with text ., a child with text foobar, a second-level child with text ., and a third-level child with text baz.

推荐答案

以大写字母开头的规则是Lexer规则.

Rules starting with a capital letter are Lexer rules.

使用以下输入文件t.text

With the following input file t.text

.
.foobar
.foobar.baz

您的语法(在Question.g4文件中)产生以下标记

your grammar (in file Question.g4) produces the following tokens

$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',<STRUCTURE_SELECTOR>,1:0]
[@1,2:8='.foobar',<STRUCTURE_SELECTOR>,2:0]
[@2,10:20='.foobar.baz',<STRUCTURE_SELECTOR>,3:0]
[@3,22:21='<EOF>',<EOF>,4:0]

词法分析器(解析器)很贪婪.它尝试使用规则读取尽可能多的输入字符(令牌).词法分析器规则STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)?可以读取一个点，一个ID，然后再读取一个点和一个ID(由于重复标记?)，直到NL.这就是为什么每一行都以单个标记结尾的原因.

The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? can read a dot, an ID, and again a dot and an ID (due to repetition marker ?), till the NL. That's why each line ends up in a single token.

编译语法时出现错误

warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string

之所以出现，是因为ID的重复标记是*(表示0次或多次)而不是+(一次或多次).

comes because the repetition marker of ID is * (which means 0 or more times) instead of +(one or more times).

现在尝试以下语法:

grammar Question;

r  
@init {System.out.println("Question last update 2135");}
    :   ( structure_selector NL )+ EOF
    ;

structure_selector
    :   '.'
    |   '.' ID structure_selector*
    ;

ID  : [_a-z0-9$]+ ;   
NL  : [\r\n]+ ;          
WS  : [ \t]+ -> skip ;

$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',<'.'>,1:0]
[@1,1:1='\n',<NL>,1:1]
[@2,2:2='.',<'.'>,2:0]
[@3,3:8='foobar',<ID>,2:1]
[@4,9:9='\n',<NL>,2:7]
[@5,10:10='.',<'.'>,3:0]
[@6,11:16='foobar',<ID>,3:1]
[@7,17:17='.',<'.'>,3:7]
[@8,18:20='baz',<ID>,3:8]
[@9,21:21='\n',<NL>,3:11]
[@10,22:21='<EOF>',<EOF>,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'

和$ grun Question r -gui t.text显示您期望的层次树结构.

and $ grun Question r -gui t.text displays the hierarchical tree structure you are expecting.

这篇关于如何使antlr4完全标记终端节点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使antlr4完全标记终端节点 [英] How to make antlr4 fully tokenize terminal nodes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使antlr4完全标记终端节点 [英] How to make antlr4 fully tokenize terminal nodes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭