使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象 [英] Parsing JSON object with large number of unique keys (not a list of objects) using PySpark
问题描述
我目前正在处理JSON文件中的以下源数据:
I'm currently dealing with the following source data in a JSON file:
{
"unique_key_1": {
"some_value_1": 1,
"some_value_2": 2
},
"unique_key_2": {
"some_value_1": 2,
"some_value_2": 3
}
"unique_key_3": {
"some_value_1": 2,
"some_value_2": 1
}
...
}
请注意,源数据对大型词典有效,具有许多唯一键.这是否的词典列表.我有很多这样的大型JSON文件,我想使用PySpark解析为以下DataFrame结构:
Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark:
key | some_value_1 | some_value_2
-------------------------------------------
unique_key_1 | 1 | 2
unique_key_2 | 2 | 3
unique_key_3 | 2 | 1
如果我正在处理小文件,我可以简单地使用类似于以下代码的方式对此进行解析:
If I was dealing with small files, I could simply parse this using code similar to:
[{**{"key": k}, **v} for (k, v) in source_dict.items()]
然后,我将在此列表上创建一个Spark DataFrame,然后继续执行我需要做的其余操作.
Then, I would create a Spark DataFrame on this list and continue on with the rest of the operations I need to do.
我的问题是我不太清楚如何将这样的大型JSON对象解析为DataFrame.当我使用SPARK.read.json("source_dict.json")
时,我得到一个带有一行的DataFrame,其中(可预测)将每个唯一的key
值作为一列读入.请注意,实际数据文件中可能有> 10万个这样的密钥.
My problem is that I can't quite figure out how to parse a large JSON object like this into a DataFrame. When I use SPARK.read.json("source_dict.json")
, I get a DataFrame with one row where each of the unique key
values is (predictably) read in as a column. Note that the real data files could have > 10s of thousands of these keys.
我对Spark世界还很陌生,而且似乎找不到找到完成此任务的方法.似乎像一个枢轴或类似的东西会有所帮助.有没有人有任何解决方案或可能解决方案的指针?谢谢,谢谢!
I'm fairly new to the Spark world, and I can't seem to find a way to accomplish this task. It seems like a pivot or something like that would help. Does anyone have any solutions or pointers to possible solutions? Thanks, I appreciate it!
推荐答案
使用平面图,您可以编写函数进行转换
Using flatmap you can write a function to make the transformation
def f(row):
l = []
d = row.asDict()
for k in d.keys():
l.append(Row(k, d[k][0], d[k][1]))
return Row(*l)
rdd = df.rdd.flatMap(f)
spark.createDataFrame(rdd).show()
+------------+---+---+
| _1| _2| _3|
+------------+---+---+
|unique_key_1| 1| 2|
|unique_key_2| 2| 3|
|unique_key_3| 2| 1|
+------------+---+---+
有关其他信息,您可以看到此链接
For additional info you can see this link
这篇关于使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!