从 AWS s3 存储桶读取拼花数据 [英] Read parquet data from AWS s3 bucket
本文介绍了从 AWS s3 存储桶读取拼花数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要从 aws s3 读取镶木地板数据.如果我为此使用 aws sdk,我可以获得如下输入流:
I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();
但是 apache parquet reader 只使用这样的本地文件:
But the apache parquet reader uses only local file like this:
ParquetReader<Group> reader =
ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
.withConf(conf)
.build();
reader.read()
所以我不知道如何解析镶木地板文件的输入流.例如,对于 csv 文件,有使用输入流的 CSVParser.
So I don't know how parse input stream for parquet file. For example for csv files there is CSVParser which uses inputstream.
我知道使用 spark 来实现这个目标的解决方案.像这样:
I know solution to use spark for this goal. Like this:
SparkSession spark = SparkSession
.builder()
.getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");
但我不能使用火花.
谁能告诉我从 s3 读取镶木地板数据的任何解决方案?
Could anyone tell me any solutions for read parquet data from s3?
推荐答案
String SCHEMA_TEMPLATE = "{" +
""type": "record",
" +
" "name": "schema",
" +
" "fields": [
" +
" {"name": "timeStamp", "type": "string"},
" +
" {"name": "temperature", "type": "double"},
" +
" {"name": "pressure", "type": "double"}
" +
" ]" +
"}";
String PATH_SCHEMA = "s3a";
Path internalPath = new Path(PATH_SCHEMA, bucketName, folderName);
Schema schema = new Schema.Parser().parse(SCHEMA_TEMPLATE);
Configuration configuration = new Configuration();
AvroReadSupport.setRequestedProjection(configuration, schema);
ParquetReader<GenericRecord> = AvroParquetReader.GenericRecord>builder(internalPath).withConf(configuration).build();
GenericRecord genericRecord = parquetReader.read();
while(genericRecord != null) {
Map<String, String> valuesMap = new HashMap<>();
genericRecord.getSchema().getFields().forEach(field -> valuesMap.put(field.name(), genericRecord.get(field.name()).toString()));
genericRecord = parquetReader.read();
}
Gradle 依赖
compile 'com.amazonaws:aws-java-sdk:1.11.213'
compile 'org.apache.parquet:parquet-avro:1.9.0'
compile 'org.apache.parquet:parquet-hadoop:1.9.0'
compile 'org.apache.hadoop:hadoop-common:2.8.1'
compile 'org.apache.hadoop:hadoop-aws:2.8.1'
compile 'org.apache.hadoop:hadoop-client:2.8.1'
这篇关于从 AWS s3 存储桶读取拼花数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文