pyspark:保存schemaRDD为JSON文件 [英] pyspark: Save schemaRDD as json file

查看:2740
本文介绍了pyspark:保存schemaRDD为JSON文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要寻找一种方式来来自Apache星火数据导出到JSON格式的各种其他工具。我presume必须有一个非常简单的方法来做到这一点。

I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.

例子:我有以下的JSON文件'jfile.json':

Example: I have the following JSON file 'jfile.json':

{"key":value_a1, "key2":value_b1},
{"key":value_a2, "key2":value_b2},
{...}

该文件的每一行是一个JSON对象。这些类型的文件可以很容易地读入PySpark与

where each line of the file is a JSON object. These kind of files can be easily read into PySpark with

jsonRDD = jsonFile('jfile.json')

然后像(通过调用jsonRDD.collect()):

and then look like (by calling jsonRDD.collect()):

[Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]

现在我想保存这些类型的文件回到一个纯粹的JSON文件。

Now I want to save these kind of files back to a pure JSON file.

我发现星火用户列表中添加此项:

I found this entry on the Spark User list:

http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-td12211.html

使用要求

RDD.saveAsTextFile(jsonRDD) 

这样做后,该文本文件看起来像

After doing this, the text file looks like

Row(key=value_a1, key2=value_b1)
Row(key=value_a2, key2=value_b2)

,即,jsonRDD刚刚被清楚地写入文件。我会读星火用户列表条目后有预期的那种AUTOMAGIC转换回JSON格式。我的目标是有一个看起来像在开头提到的jfile.json的文件。

, i.e., the jsonRDD has just been plainly written to the file. I would have expected a kind of an "automagic" conversion back to JSON format after reading the Spark User List entry. My goal is to have a file that looks like 'jfile.json' mentioned in the beginning.

我缺少一个真正明显的简单的方法来做到这一点?

Am I missing a really obvious easy way to do this?

我看了 http://spark.apache.org/docs/latest/编程-guide.html ,谷歌搜索,寻找答案的用户列表和堆栈溢出,但几乎所有的答案处理的读取和解析JSON成星火。我甚至买了这本书学习星火,但也有例子(第71页)只会导致与上面相同的输出文件。

I read http://spark.apache.org/docs/latest/programming-guide.html, searched google, the user list and stack overflow for answers, but almost all answers deal with reading and parsing JSON into Spark. I even bought the book 'Learning Spark', but the examples there (p. 71) just lead to the same output file as above.

任何人可以帮助我在这里?我觉得我缺少这里只是一个小环节

Can anybody help me out here? I feel like I am missing just a small link in here

欢呼和预先感谢!

推荐答案

我看不到一个简单的方法来做到这一点。一种解决方案是在 SchemaRDD 的每个元素转换为字符串,与结束了 RDD [字符串] 其中,每个元素的格式化JSON该行。所以,你需要编写自己的JSON序列化。这是比较容易的部分。它可能不是超级快,但它应该并行工作,你已经知道如何在 RDD 保存到一个文本文件中。

I can't see an easy way to do it. One solution is to convert each element of the SchemaRDD to a String, ending up with an RDD[String] where each of the elements is formatted JSON for that row. So, you need to write your own JSON serializer. That's the easy part. It may not be super fast but it should work in parallel, and you already know how to save an RDD to a text file.

关键见解是,你可以通过调用模式 SchemaRDD 的C $ C>方法。然后,通过地图交给你每个需要用的架构一并递归遍历的。这实际上是平的JSON一个在串联列表遍历,但你也可能需要考虑嵌套的JSON。

The key insight is that you can get a representation of the schema out of the SchemaRDD by calling the schema method. Then each Row handed to you by map needs to be traversed recursively in conjunction with the schema. This is actually an in-tandem list traversal for flat JSON, but you may also need to consider nested JSON.

剩下的是Python的,我不说话只是一件小事,但我确实有这个的斯卡拉工作的情况下,它可以帮助你。其中斯卡拉code得到密集的地方其实并不依赖于深星火知识,如果你能理解基本的递归和了解Python,你应该能够使它发挥作用。为您的大部分工作是搞清楚如何使用 pyspark.sql.Row pyspark.sql.StructType 了Python的API中。

The rest is just a small matter of Python, which I don't speak, but I do have this working in Scala in case it helps you. The parts where the Scala code gets dense actually don't depend on deep Spark knowledge so if you can understand the basic recursion and know Python you should be able to make it work. The bulk of the work for you is figuring out how to work with a pyspark.sql.Row and a pyspark.sql.StructType in the Python API.

一个忠告:我是pretty相信我的code中还不缺失值的情况下工作 - formatItem 方法需要处理null元素。

One word of caution: I'm pretty sure my code doesn't yet work in the case of missing values -- the formatItem method needs to handle null elements.

编辑:星火1.2.0 的toJSON 方法被引入到 SchemaRDD ,使之成为一个的的简单的问题 - 看到答案由@jegordon

In Spark 1.2.0 the toJSON method was introduced to SchemaRDD, making this a much simpler problem -- see the answer by @jegordon.

这篇关于pyspark:保存schemaRDD为JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆