格式化JSON文件SQLContext [英] Formatting JSON files for SQLContext

查看:191
本文介绍了格式化JSON文件SQLContext的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我加载JSON这是依赖于输入JSON文件格式时遇到的问题。

I'm experiencing issues when loading JSON which are dependent on formatting of input JSON file.

据星火的JSON数据集文档,输入文件中的每一行必须是一个有效的JSON对象。重:

According to Spark documentation on JSON Datasets, each line on input file must be a valid JSON Object. re:

请注意,是提供作为一个JSON文件的文件是不是一个典型的JSON文件。每一行必须包含一个独立的,自包含有效的JSON对象。因此,常规的多线JSON文件最常失败。

"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail."

所以,如果我有一个输入JSON文件,如:

So, if I have an input JSON file such as:

{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}

是否有任何现有的工具或脚本转换为:

Are there any existing tools or scripts to convert to:

{"Year": "2013","First Name": "DAVID","County": "KINGS","Sex": "M","Count":"272"},
{"Year": "2013","First Name": "JAYDEN","County": "KINGS","Sex": "M","Count": "268"}

其中,JSON符合每行必须包含一个独立的,自包含有效的JSON对象

where the JSON conforms to "Each line must contain a separate, self-contained valid JSON object"

如果我格式化这种风格上面,事情正常工作。但是,我在手动几排使这些器官功能障碍综合征。我不能代表整个数据集这样做,因此寻找一个现有的脚本或工具。

If I format to this style above, things work as expected. But, I made these mods manually over a few rows. I cannot do this for entire data set, so looking for an existing script or tool.

如果这是一个更好的选择,我可以装载到JDBC可用的数据库。思考?

I could load to JDBC available database if that's a better option. Thoughts?

在此先感谢

推荐答案

您可以简单的JSON文件加载到一个RDD首先使用 sc.wholeTextFiles(),然后取出文件名栏,然后运行 SQLContext 读了RDD内容。

You can simply load the JSON files into an RDD first using sc.wholeTextFiles() and remove the file name column, then run the SQLContext read on the RDD contents.

例如

val jsonRdd = sc.wholeTextFiles("samplefile.json").map(x => x._2)
val jsonDf = sqlContext.read.json(jsonRdd)

这篇关于格式化JSON文件SQLContext的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆