如何在Python中将JSON文件的目录加载到Apache Spark中 [英] How to load directory of JSON files into Apache Spark in Python

查看：157 发布时间：2017/5/21 20:24:04 python json dictionary apache-spark

本文介绍了如何在Python中将JSON文件的目录加载到Apache Spark中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对Apache Spark来说比较新，而且我想在Python中创建一个单独的RDD，这些RDD在保存在多个JSON文件中的字典列表中（每个都是gzip压缩包含一个字典的列表）。因此，大致来讲，所得到的RDD将包含所有字典列表组合成单个词典列表。我没有在文档中找到这个（ https：//spark.apache .org / docs / 1.2.0 / api / python / pyspark.html ），但如果我错过了，请让我知道。

到目前为止，我尝试阅读JSON文件并在Python中创建组合列表，然后使用sc.parallelize（），但是整个数据集太大，无法适应内存所以这不是一个实际的解决方案。似乎Spark会有一个聪明的方式来处理这个用例，但我不知道。

如何在Python中创建单个RDD，包含所有JSON文件中的列表？

I还应该提到我不想使用Spark SQL。我想使用地图，过滤器等功能，如果可能的话。

解决方案

以下是他们提到的tgpfeiffer答案和评论，这是我做的。

首先，正如他们提到的，JSON文件必须被格式化，因此每行有一个字典，而不是单个列表字典。然后，它简单如下：

  my_RDD_strings = sc.textFile（path_to_dir_with_JSON_files）
 my_RDD_dictionaries = my_RDD_strings.map（ json.loads）

如果有更好或更有效的方式来做，请让我知道，但这似乎是有效的。

I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.

So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.

How can I create a single RDD in Python comprising the lists in all of the JSON files?

I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.

解决方案

Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)

If there's a better or more efficient way to do this, please let me know, but this seems to work.

这篇关于如何在Python中将JSON文件的目录加载到Apache Spark中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Python中将JSON文件的目录加载到Apache Spark中 [英] How to load directory of JSON files into Apache Spark in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Python中将JSON文件的目录加载到Apache Spark中 [英] How to load directory of JSON files into Apache Spark in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭