S3实现org.apache.parquet.io.InputFile吗? [英] S3 Implementation for org.apache.parquet.io.InputFile?

查看:174
本文介绍了S3实现org.apache.parquet.io.InputFile吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图编写一个基于Scala的AWS Lambda来读取基于S3的Snappy压缩Parquet文件.该过程会将其回写到分区的JSON文件中.

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.

我一直在尝试使用org.apache.parquet.hadoop.ParquetFileReader类读取文件...一种不建议使用的方法似乎将其传递给org.apache.parquet.io的实现.InputFile接口. Hadoop(HadoopInputFile)有一个...但是我找不到S3的一个.我还尝试了此类不推荐使用的方法,但也无法使它们与S3一起使用.

I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.

解决这个难题吗?

以防万一有人感兴趣……为什么我要在Scala中这样做?好吧...我想不通另一种方法.用于Parquet的Python实现(pyarrow和fastparquet)似乎都在复杂的基于列表/结构的架构中苦苦挣扎.

Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.

我还看到了一些基于AvroParquetReader的代码(从AWS读取木地板数据s3存储桶)可能是另一种解决方案,但是如果没有已知的架构,我将无法使它们正常工作.但也许我在那里缺少什么.

Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.

我真的很想让ParquetFileReader类正常工作,因为它看起来很干净.

I'd really like to get the ParquetFileReader class to work, as it seem clean.

赞赏任何想法.

推荐答案

Hadoop使用其自己的文件系统抽象层,该层具有s3的实现(

Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).

设置应类似于以下内容(java,但应与scala一起使用):

The setup should look someting like the following (java, but same should work with scala):

Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
    DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider


URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);

ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

这篇关于S3实现org.apache.parquet.io.InputFile吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆