如何在Python中将JSON文件的目录加载到Apache Spark中 [英] How to load directory of JSON files into Apache Spark in Python

查看:157
本文介绍了如何在Python中将JSON文件的目录加载到Apache Spark中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Apache Spark来说比较新,而且我想在Python中创建一个单独的RDD,这些RDD在保存在多个JSON文件中的字典列表中(每个都是gzip压缩包含一个字典的列表)。因此,大致来讲,所得到的RDD将包含所有字典列表组合成单个词典列表。我没有在文档中找到这个( https://spark.apache .org / docs / 1.2.0 / api / python / pyspark.html ),但如果我错过了,请让我知道。



到目前为止,我尝试阅读JSON文件并在Python中创建组合列表,然后使用sc.parallelize(),但是整个数据集太大,无法适应内存所以这不是一个实际的解决方案。似乎Spark会有一个聪明的方式来处理这个用例,但我不知道。



如何在Python中创建单个RDD,包含所有JSON文件中的列表?



I还应该提到我不想使用Spark SQL。我想使用地图,过滤器等功能,如果可能的话。

解决方案

以下是他们提到的tgpfeiffer答案和评论,这是我做的。



首先,正如他们提到的,JSON文件必须被格式化,因此每行有一个字典,而不是单个列表字典。然后,它简单如下:

  my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map( json.loads)

如果有更好或更有效的方式来做,请让我知道,但这似乎是有效的。


I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.

So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.

How can I create a single RDD in Python comprising the lists in all of the JSON files?

I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.

解决方案

Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)

If there's a better or more efficient way to do this, please let me know, but this seems to work.

这篇关于如何在Python中将JSON文件的目录加载到Apache Spark中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆