使用 Antlr 从永无止境的流中解析数据 [英] Using Antlr for parsing data from never-ending stream

查看:28
本文介绍了使用 Antlr 从永无止境的流中解析数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Antlr 是否适合在要解析的文本之后解析没有 EOF 的流中的数据?根据我的观察,词法分析器在收到下一个标记的第一个字符之前不会发出当前标记.最重要的是 - 在收到下一条规则的第一个标记之前,解析器似乎不会发出规则.这是我尝试过的简单语法:

fox: 'quick' 'brown' 'fox' '\r'?'\n' ;

然后我将生成的解析器与 UnbufferedCharStream 和 UnbufferedTokenStream 一起使用:

 CharStream input = new UnbufferedCharStream(is);MyLexer lex = new MyLexer(input);lex.setTokenFactory(new CommonTokenFactory(true));TokenStream 令牌 = 新的 UnbufferedTokenStream(lex);MyParser parser = new MyParser(tokens);MyParser.FoxContext fox = parser.fox();

当流变得快速"时 - 什么也没有发生.

当 'b' 进来时 - 输入规则 'fox'

然后 'roun' - 什么都没有(流中有 2 个令牌 - 它们中没有一个是未知的!)

仅在 'f' 之后侦听器访问第一个标记:'quick'

then - 'ox' 上什么都没有

在新行(unix)上:访问令牌'brown'

现在流拥有所有数据(4 个标记),但只有 2 个标记被识别.

我发现为了通过系统推送这些标记,流可以发出 2 个标记,即语法已知的任何标记.它可能是 2 个额外的新行,或者比方说fox"和brown".只有这样标记 'fox' 和 '\n' 被访问,解析器退出规则 'fox' 并完成解析.

这是错误还是功能?有没有办法消除这种滞后?

谢谢!

解决方案

ANTLR 4 书原本打算包含一个解析流输入的例子,但我反对它,因为不可避免地会出现严重的并发症使用自适应无限前瞻解析器进行此类操作.

ANTLR 4 没有保证的前瞻边界(也没有办法告诉它寻找甚至尝试强制执行),因此任何对阻塞流进行操作的实现都有可能发生死锁,而不会返回有关解析的信息到那一点.我什至不会考虑解析流输入的可能性,除非我首先看到了中间缓冲区.

  1. 获取所有可用的(或之前未解析的)输入并将其放入 Stringchar[] 中.
  2. 为缓冲区创建一个 ANTLRInputStream.
  3. 尝试对这个流进行词法/解析,这将在末尾有一个隐式的 EOF.

解析的结果将告诉您是放弃该点的结果,还是在有更多数据可用时保持它们重试:

  • 如果没有出现语法错误,则说明输入已成功解析,您可以在稍后输入的下一部分可用时对其进行解析.

  • 如果在使用 EOF 标记之前 报告语法错误,则实际输入中会出现语法错误,因此您需要处理它(将其报告给用户,等等...)

  • 如果在使用 EOF 令牌时报告语法错误,则额外的输入可能会解决问题 - 忽略当前解析的结果,然后在输入流中有更多数据可用时重试.

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until the first character of next token is received. On top of that - the parser seems not to emit the rule until the first token of next rule is received. Here is a simple grammar I tried:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

Then I used the generated parser with UnbufferedCharStream and UnbufferedTokenStream:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

when the stream gets 'quick' - nothing happens.

when 'b' comes in - entering rule 'fox'

then 'roun' - nothing (2 tokens are in the stream - none of them is known to leser yet!)

only after 'f' the listener visits the first token: 'quick'

then - nothing on 'ox'

on new line (unix): visit token 'brown'

Now the stream has all data (4 tokens), but only 2 tokens are recognized.

I found that in order to push those tokens through the system the stream can emit 2 tokens, that is any tokens known to the grammar. It could be 2 extra new lines, or let's say 'fox' and 'brown'. Only then the tokens 'fox' and '\n' get visited, the parser exits rule 'fox' and parsing gets finished.

Is that a bug or a feature? Is there a way to eliminate that lag?

Thanks!

解决方案

The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.

ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.

  1. Take all available (or previously unparsed) input and place it in a String or char[].
  2. Create an ANTLRInputStream for the buffer.
  3. Attempt to lex/parse this stream, which will have an implicit EOF on the end.

The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:

  • If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.

  • If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).

  • If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.

这篇关于使用 Antlr 从永无止境的流中解析数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆