ANTLR4 Java解析器可以处理非常大的文件还是可以流式处理文件 [英] Can ANTLR4 java parser handle very large files or can it stream files

查看:310
本文介绍了ANTLR4 Java解析器可以处理非常大的文件还是可以流式处理文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ANTLR生成的Java解析器是否能够传输任意大文件?

Is the java parser generated by ANTLR capable of streaming arbitrarily large files?

我尝试用UnbufferedCharStream构造Lexer并将其传递给解析器.我收到一个UnsupportedOperationException异常,原因是调用了UnbufferedCharStream上的size,并且该异常包含一个解释,即您无法调用UnbufferedCharStream上的size.

I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.

    new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
    CommonTokenStream stream = new CommonTokenStream(lexer);
    Parser parser = new Parser(stream);

我基本上有一个使用Pig从hadoop导出的文件.它具有大量由'\ n'分隔的行.每列均以'\ t'分隔.在Java中,这很容易解析,因为我使用缓冲的读取器来读取每一行.然后我用'\ t'分开以获取每一列.但我也想进行某种模式验证.第一列应为格式正确的日期,其后是一些价格列,然后是一些十六进制列.

I basically have a file I exported from hadoop using pig. It has a large number of rows separated by '\n'. Each column is split by a '\t'. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '\t' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.

当我查看生成的解析器代码时,我可以这样称呼

When I look at the generated parser code I could call it like so

    parser.lines().line()

这将为我提供一个列表,从概念上讲,我可以对其进行遍历.但是似乎列表在我得到时将具有固定的大小.这意味着解析器可能已经解析了整个文件.

This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.

API的另一部分是否允许您流式传输非常大的文件?就像在读取文件时使用Visitor或Listener进行调用的某种方式吗?但是它不能将整个文件保留在内存中.它不合适.

Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.

推荐答案

您可以这样做:

InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();

如果文件很大,请忽略侦听器或访问者-我将直接在语法中创建对象.只需将它们全部放入某种结构(即HashMap,Vector ...),然后根据需要进行检索.这样就避免了创建解析树(这确实需要很多内存).

And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.

这篇关于ANTLR4 Java解析器可以处理非常大的文件还是可以流式处理文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆