如何在Python中将JSON文件的目录加载到Apache Spark中 [英] How to load directory of JSON files into Apache Spark in Python
问题描述
到目前为止,我尝试阅读JSON文件并在Python中创建组合列表,然后使用sc.parallelize(),但是整个数据集太大,无法适应内存所以这不是一个实际的解决方案。似乎Spark会有一个聪明的方式来处理这个用例,但我不知道。
如何在Python中创建单个RDD,包含所有JSON文件中的列表?
I还应该提到我不想使用Spark SQL。我想使用地图,过滤器等功能,如果可能的话。
以下是他们提到的tgpfeiffer答案和评论,这是我做的。
首先,正如他们提到的,JSON文件必须被格式化,因此每行有一个字典,而不是单个列表字典。然后,它简单如下:
my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map( json.loads)
如果有更好或更有效的方式来做,请让我知道,但这似乎是有效的。
I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.
So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.
How can I create a single RDD in Python comprising the lists in all of the JSON files?
I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.
Following what tgpfeiffer mentioned in their answer and comment, here's what I did.
First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:
my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)
If there's a better or more efficient way to do this, please let me know, but this seems to work.
这篇关于如何在Python中将JSON文件的目录加载到Apache Spark中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!