使用Antlr解析永无止境的流中的数据 [英] Using Antlr for parsing data from never-ending stream

查看:150
本文介绍了使用Antlr解析永无止境的流中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Antlr是否适合从要解析的文本之后立即解析没有EOF的流中的数据? 根据我的观察,在接收到下一个标记的第一个字符之前,词法分析器不会发出当前标记. 最重要的是-解析器似乎不会在收到下一个规则的第一个标记之前不发出规则. 这是我尝试过的简单语法:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

然后我将生成的解析器与UnbufferedCharStream和UnbufferedTokenStream一起使用:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

当流获得"快速"时-没有任何反应.

当出现" b "时-输入规则" fox "

然后是' roun '-什么都没有(流中有2个令牌-尚不知道其中的一个!)

仅在" f "之后,听众才访问第一个令牌:"快速"

然后-' ox '

上没有任何内容

在新行(unix)上:访问令牌"棕色"

现在,该流具有所有数据(4个令牌),但是只能识别2个令牌.

我发现,为了将这些标记推入系统,流可以发出2个标记,即语法已知的任何标记. 可能是另外两行,或者说" fox "和"棕色". 只有访问令牌" fox "和" \ n ",解析器才会退出规则" fox "并完成解析.

是错误还是功能? 有办法消除这种滞后吗?

谢谢!

解决方案

《 ANTLR 4》原本将包含一个解析流输入的示例,但由于它会不可避免地带来严重的复杂性,因此我对此表示反对.对这样的事情使用自适应无限超前解析器.

ANTLR 4没有保证的超前界限(也没有办法告诉它寻找甚至试图强制执行的界限),因此任何对阻塞流进行操作的实现都可能发生死锁,而不会返回有关导致解析的信息到这一点.除非我首先看到一个中间缓冲区,否则我什至都不愿意解析流输入的可能性.

  1. 接受所有可用(或以前未解析的)输入并将其放置在Stringchar[]中.
  2. 为缓冲区创建ANTLRInputStream.
  3. 尝试对该流进行lex/解析,该流最后将具有隐式EOF.

解析的结果将告诉您是放弃该结果还是在有更多数据可用时保留它们以重试:

  • 如果未发生语法错误,则说明输入已成功解析,您可以在稍后可用时解析输入的下一部分.

  • 如果在使用EOF令牌之前 报告了语法错误,则实际输入中会出现语法错误,因此您需要进行处理(将其报告给用户,等).

  • 如果在消耗EOF令牌的时间点报告了语法错误,则附加输入可能会解决问题-忽略当前解析的结果,然后从输入流中获取更多数据后重试.

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until the first character of next token is received. On top of that - the parser seems not to emit the rule until the first token of next rule is received. Here is a simple grammar I tried:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

Then I used the generated parser with UnbufferedCharStream and UnbufferedTokenStream:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

when the stream gets 'quick' - nothing happens.

when 'b' comes in - entering rule 'fox'

then 'roun' - nothing (2 tokens are in the stream - none of them is known to leser yet!)

only after 'f' the listener visits the first token: 'quick'

then - nothing on 'ox'

on new line (unix): visit token 'brown'

Now the stream has all data (4 tokens), but only 2 tokens are recognized.

I found that in order to push those tokens through the system the stream can emit 2 tokens, that is any tokens known to the grammar. It could be 2 extra new lines, or let's say 'fox' and 'brown'. Only then the tokens 'fox' and '\n' get visited, the parser exits rule 'fox' and parsing gets finished.

Is that a bug or a feature? Is there a way to eliminate that lag?

Thanks!

解决方案

The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.

ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.

  1. Take all available (or previously unparsed) input and place it in a String or char[].
  2. Create an ANTLRInputStream for the buffer.
  3. Attempt to lex/parse this stream, which will have an implicit EOF on the end.

The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:

  • If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.

  • If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).

  • If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.

这篇关于使用Antlr解析永无止境的流中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆