在没有Hadoop Path API的情况下读取本地Parquet文件 [英] Read local Parquet file without Hadoop Path API

查看:129
本文介绍了在没有Hadoop Path API的情况下读取本地Parquet文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取本地Parquet文件,但是我只能找到与Hadoop紧密结合的API,并且需要Hadoop Path 作为输入(甚至指向本地文件)

  ParquetReader< GenericRecord>< GenericRecord> builder(file).build();GenericRecord nextRecord = reader.read(); 

how最受欢迎的答案可以使用独立的Java代码读取实木复合地板文件吗?,但是需要Hadoop Path ,并且现在已弃用神秘的 IntelliJ插件,该插件允许用户将Avro文件拖放到窗格中以在表中查看.该插件当前为5MB.如果我包括Parquet和Hadoop依赖项,那么它会膨胀到50MB以上,并且解决方案

不幸的是,java parquet实现并非独立于某些hadoop库.他们的bugtracker中存在一个问题,问题是可以轻松地在Java中读取和写入镶木地板文件而无需取决于hadoop ,但似乎并没有太大进展.添加了 InputFile 接口以添加一些解耦,但是许多实现镶木地板元数据部分的类以及所有压缩编解码器都驻留在hadoop依赖项中.

我找到了另一种实现在微笑库中的 InputFile 中,可能比通过hadoop文件系统抽象更有效,但不能解决依赖性问题.

正如已经提到的其他答案一样,您可以为本地文件创建hadoop Path ,并可以毫无问题地使用它.

  java.io.File文件= ...新的org.apache.hadoop.fs.Path(file.toURI()) 

通过定义一些排除项,可以大大减少hadoop引入的依赖树.我正在使用以下方法来减轻膨胀(使用gradle语法):

  compile("org.apache.hadoop:hadoop-common:3.1.0"){排除(组:'org.slf4j')排除(组:'org.mortbay.jetty')排除(组:'javax.servlet.jsp')排除(群组:'com.sun.jersey')排除(组:'log4j')排除(群组:'org.apache.curator')排除(群组:'org.apache.zookeeper')排除(群组:'org.apache.kerby')排除(组:"com.google.protobuf")} 

I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file).

ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord nextRecord = reader.read();

is the most popular answer in how to read a parquet file, in a standalone java code?, but requires a Hadoop Path and has now been deprecated for a mysterious InputFile instead. The only implementation of InputFile I can find is HadoopInputFile, so again no help.

In Avro this is a simple:

DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
this.dataFileReader = new DataFileReader<>(file, datumReader);

(where file is java.io.File). What's the Parquet equivalent?

I am asking for no Hadoop Path dependency in the answers, because Hadoop drags in bloat and jar hell, and it seems silly to require it for reading local files.

To further explain the backstory, I maintain a small IntelliJ plugin that allows users to drag-and-drop Avro files into a pane for viewing in a table. This plugin is currently 5MB. If I include Parquet and Hadoop dependencies, it bloats to over 50MB, and doesn't even work.


POST-ANSWER ADDENDUM

Now that I have it working (thanks to the accepted answer), here is my working solution that avoids all the annoying errors that can be dragged in by depending heavily on the Hadoop Path API:

解决方案

Unfortunately the java parquet implementation is not independent of some hadoop libraries. There is an existing issue in their bugtracker to make it easy to read and write parquet files in java without depending on hadoop but there does not seem to be much progress on it. The InputFile interface was added to add a bit of decoupling, but a lot of the classes that implement the metadata part of parquet and also all compression codecs live inside the hadoop dependency.

I found another implementation of InputFile in the smile library, this might be more efficient than going through the hadoop filesystem abstraction, but does not solve the dependency problem.

As other answers already mention, you can create an hadoop Path for a local file and use that without problems.

java.io.File file = ...
new org.apache.hadoop.fs.Path(file.toURI())

The dependency tree that is pulled in by hadoop can be reduced a lot by defining some exclusions. I'm using the following to reduce the bloat (using gradle syntax):

compile("org.apache.hadoop:hadoop-common:3.1.0") {
    exclude(group: 'org.slf4j')
    exclude(group: 'org.mortbay.jetty')
    exclude(group: 'javax.servlet.jsp')
    exclude(group: 'com.sun.jersey')
    exclude(group: 'log4j')
    exclude(group: 'org.apache.curator')
    exclude(group: 'org.apache.zookeeper')
    exclude(group: 'org.apache.kerby')
    exclude(group: 'com.google.protobuf')
}

这篇关于在没有Hadoop Path API的情况下读取本地Parquet文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆