将JSON转换为Parquet [英] Convert JSON to Parquet

查看:645
本文介绍了将JSON转换为Parquet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些JSON格式的TB日志数据,我想将它们转换为Parquet格式,以便在分析阶段获得更好的性能.

I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage.

我通过编写使用 parquet-mr 和 parquet-avro .

我唯一不满意的是,我的JSON日志没有固定的架构,我不知道所有字段的名称和类型.此外,即使我知道所有字段的名称和类型,我的架构也会随着时间的推移而发展,例如,将来会添加新的字段.

The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added in future.

目前,我必须为AvroWriteSupport提供一个Avro模式,并且avro仅允许固定数量的字段.

For now I have to provide a Avro schema for AvroWriteSupport, and avro only allows fixed number of fields.

是否有更好的方法像JSON一样在Parquet中存储任意字段?

Is there a better way to store arbitrary fields in Parquet, just like JSON?

推荐答案

可以肯定的一件事是Parquet事先需要一个Avro模式.我们将重点介绍如何获取模式.

One thing for sure is that Parquet needs a Avro schema in advance. We'll focus on how to get the schema.

  1. 使用SparkSQL将JSON文件转换为Parquet文件.

  1. Use SparkSQL to convert JSON files to Parquet files.

SparkSQL可以从数据自动推断模式,因此我们不需要自己提供模式.每次数据更改时,SparkSQL都会推断出一个不同的模式.

SparkSQL can infer a schema automatically from data, thus we don't need to provide a schema by ourselves. Every time the data changes, SparkSQL will infer out a different schema.

手动维护Avro模式.

Maintain an Avro schema manually.

如果您不使用Spark,而仅使用Hadoop,则需要手动推断架构.首先,您需要编写一个mapreduce作业来扫描所有JSON文件并获取所有字段,在您知道所有字段之后就可以编写一个Avro模式.使用此架构将JSON文件转换为Parquet文件.

If you don't use Spark but only Hadoop, you need to infer the schema manually. First write a mapreduce job to scan all JSON files and get all fields, after you know all fields you can write an Avro schema. Use this schema to convert JSON files to Parquet files.

将来会有新的未知字段,每当有新字段添加到Avro模式中.因此,基本上,我们是手动执行SparkSQL的工作.

There will be new unknown fields in future, every time there are new fields, add them to the Avro schema. So basically we're doing SparkSQL's job manually.

这篇关于将JSON转换为Parquet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆