使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象 [英] Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

查看：63 发布时间：2021/2/13 21:02:18 python json apache-spark pyspark

本文介绍了使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在处理JSON文件中的以下源数据:

I'm currently dealing with the following source data in a JSON file:

{
    "unique_key_1": {
        "some_value_1": 1,
        "some_value_2": 2
    },
    "unique_key_2": {
        "some_value_1": 2,
        "some_value_2": 3
    }
    "unique_key_3": {
        "some_value_1": 2,
        "some_value_2": 1
    }
    ...
}

请注意，源数据对大型词典有效，具有许多唯一键.这是否的词典列表.我有很多这样的大型JSON文件，我想使用PySpark解析为以下DataFrame结构:

Note that the source data is effective a large dictionary, with lots of unique keys. It is NOT a list of dictionaries. I have lots of large JSON files like this that I want to parse into the following DataFrame structure using PySpark:

key          | some_value_1 | some_value_2
-------------------------------------------
unique_key_1 |            1 |            2
unique_key_2 |            2 |            3
unique_key_3 |            2 |            1

如果我正在处理小文件，我可以简单地使用类似于以下代码的方式对此进行解析:

If I was dealing with small files, I could simply parse this using code similar to:

[{**{"key": k}, **v} for (k, v) in source_dict.items()]

然后，我将在此列表上创建一个Spark DataFrame，然后继续执行我需要做的其余操作.

Then, I would create a Spark DataFrame on this list and continue on with the rest of the operations I need to do.

我的问题是我不太清楚如何将这样的大型JSON对象解析为DataFrame.当我使用SPARK.read.json("source_dict.json")时，我得到一个带有一行的DataFrame，其中(可预测)将每个唯一的key值作为一列读入.请注意，实际数据文件中可能有> 10万个这样的密钥.

My problem is that I can't quite figure out how to parse a large JSON object like this into a DataFrame. When I use SPARK.read.json("source_dict.json"), I get a DataFrame with one row where each of the unique key values is (predictably) read in as a column. Note that the real data files could have > 10s of thousands of these keys.

我对Spark世界还很陌生，而且似乎找不到找到完成此任务的方法.似乎像一个枢轴或类似的东西会有所帮助.有没有人有任何解决方案或可能解决方案的指针?谢谢，谢谢！

I'm fairly new to the Spark world, and I can't seem to find a way to accomplish this task. It seems like a pivot or something like that would help. Does anyone have any solutions or pointers to possible solutions? Thanks, I appreciate it!

使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象 [英] Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用PySpark解析具有大量唯一键(不是对象列表)的JSON对象 [英] Parsing JSON object with large number of unique keys (not a list of objects) using PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭