从AWS s3存储桶中读取实木复合地板数据 [英] Read parquet data from AWS s3 bucket

查看：119 发布时间：2020/8/23 4:31:52 java amazon-web-services amazon-s3 parquet

本文介绍了从AWS s3存储桶中读取实木复合地板数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从AWS s3中读取实木复合地板数据.如果我为此使用aws sdk，则可以得到如下输入流:

I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:

S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();

但是apache实木复合地板阅读器仅使用本地文件，如下所示:

But the apache parquet reader uses only local file like this:

ParquetReader<Group> reader =
                    ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
                            .withConf(conf)
                            .build();
reader.read()

所以我不知道如何解析实木复合地板文件的输入流. 例如，对于csv文件，存在使用inputstream的CSVParser.

So I don't know how parse input stream for parquet file. For example for csv files there is CSVParser which uses inputstream.

我知道使用Spark实现此目标的解决方案. 像这样:

I know solution to use spark for this goal. Like this:

SparkSession spark = SparkSession
                .builder()
                .getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");

但是我不能使用spark.

But I cannot use spark.

有人可以告诉我从s3读取实木复合地板数据的任何解决方案吗?

Could anyone tell me any solutions for read parquet data from s3?

推荐答案

String SCHEMA_TEMPLATE = "{" +
                        "\"type\": \"record\",\n" +
                        "    \"name\": \"schema\",\n" +
                        "    \"fields\": [\n" +
                        "        {\"name\": \"timeStamp\", \"type\": \"string\"},\n" +
                        "        {\"name\": \"temperature\", \"type\": \"double\"},\n" +
                        "        {\"name\": \"pressure\", \"type\": \"double\"}\n" +
                        "    ]" +
                        "}";
String PATH_SCHEMA = "s3a";
Path internalPath = new Path(PATH_SCHEMA, bucketName, folderName);
Schema schema = new Schema.Parser().parse(SCHEMA_TEMPLATE);
Configuration configuration = new Configuration();
AvroReadSupport.setRequestedProjection(configuration, schema);
ParquetReader<GenericRecord> = AvroParquetReader.GenericRecord>builder(internalPath).withConf(configuration).build();
GenericRecord genericRecord = parquetReader.read();

while(genericRecord != null) {
        Map<String, String> valuesMap = new HashMap<>();
        genericRecord.getSchema().getFields().forEach(field -> valuesMap.put(field.name(), genericRecord.get(field.name()).toString()));

        genericRecord = parquetReader.read();
}

等级依赖性

    compile 'com.amazonaws:aws-java-sdk:1.11.213'
    compile 'org.apache.parquet:parquet-avro:1.9.0'
    compile 'org.apache.parquet:parquet-hadoop:1.9.0'
    compile 'org.apache.hadoop:hadoop-common:2.8.1'
    compile 'org.apache.hadoop:hadoop-aws:2.8.1'
    compile 'org.apache.hadoop:hadoop-client:2.8.1'

这篇关于从AWS s3存储桶中读取实木复合地板数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从AWS s3存储桶中读取实木复合地板数据 [英] Read parquet data from AWS s3 bucket

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

从AWS s3存储桶中读取实木复合地板数据 [英] Read parquet data from AWS s3 bucket

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭