使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象 [英] Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

查看:63
本文介绍了使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理JSON文件中的以下源数据:

I'm currently dealing with the following source data in a JSON file:

{
    "unique_key_1": {
        "some_value_1": 1,
        "some_value_2": 2
    },
    "unique_key_2": {
        "some_value_1": 2,
        "some_value_2": 3
    }
    "unique_key_3": {
        "some_value_1": 2,
        "some_value_2": 1
    }
    ...
}

请注意,源数据对大型词典有效,具有许多唯一键.这是的词典列表.我有很多这样的大型JSON文件,我想使用PySpark解析为以下DataFrame结构:

Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark:

key          | some_value_1 | some_value_2
-------------------------------------------
unique_key_1 |            1 |            2
unique_key_2 |            2 |            3
unique_key_3 |            2 |            1

如果我正在处理小文件,我可以简单地使用类似于以下代码的方式对此进行解析:

If I was dealing with small files, I could simply parse this using code similar to:

[{**{"key": k}, **v} for (k, v) in source_dict.items()] 

然后,我将在此列表上创建一个Spark DataFrame,然后继续执行我需要做的其余操作.

Then, I would create a Spark DataFrame on this list and continue on with the rest of the operations I need to do.

我的问题是我不太清楚如何将这样的大型JSON对象解析为DataFrame.当我使用SPARK.read.json("source_dict.json")时,我得到一个带有一行的DataFrame,其中(可预测)将每个唯一的key值作为一列读入.请注意,实际数据文件中可能有> 10万个这样的密钥.

My problem is that I can't quite figure out how to parse a large JSON object like this into a DataFrame. When I use SPARK.read.json("source_dict.json"), I get a DataFrame with one row where each of the unique key values is (predictably) read in as a column. Note that the real data files could have > 10s of thousands of these keys.

我对Spark世界还很陌生,而且似乎找不到找到完成此任务的方法.似乎像一个枢轴或类似的东西会有所帮助.有没有人有任何解决方案或可能解决方案的指针?谢谢,谢谢!

I'm fairly new to the Spark world, and I can't seem to find a way to accomplish this task. It seems like a pivot or something like that would help. Does anyone have any solutions or pointers to possible solutions? Thanks, I appreciate it!

推荐答案

使用平面图,您可以编写函数进行转换

Using flatmap you can write a function to make the transformation

def f(row):
l = []
d = row.asDict()
for k in d.keys():
    l.append(Row(k, d[k][0], d[k][1]))
return Row(*l)

rdd = df.rdd.flatMap(f)
spark.createDataFrame(rdd).show()


+------------+---+---+
|          _1| _2| _3|
+------------+---+---+
|unique_key_1|  1|  2|
|unique_key_2|  2|  3|
|unique_key_3|  2|  1|
+------------+---+---+

有关其他信息,您可以看到此链接

For additional info you can see this link

这篇关于使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆