PySpark-将RDD转换为JSON [英] PySpark - RDD to JSON

查看：840 发布时间：2019/11/24 21:24:29 arrays json pyspark

本文介绍了PySpark-将RDD转换为JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Hive查询，它以这种格式返回数据:

I have a Hive query that returns data in this format:

ip, category, score
1.2.3.4, X, 5
10.10.10.10, A, 2
1.2.3.4, Y, 2
12.12.12.12, G, 10
1.2.3.4, Z, 9
10.10.10.10, X, 3

在PySpark中，我是通过hive_context.sql(my_query).rdd

In PySpark, I get this via hive_context.sql(my_query).rdd

每个ip地址可以具有多个分数(因此具有多个行).我想以json/array格式获取此数据，如下所示:

Each ip address can have multiple scores (hence multiple rows). I would like to get this data in a json/array format as follows:

{
    "ip": "1.2.3.4",
    "scores": [
        {
            "category": "X",
             "score": 10
        },
        {
            "category": "Y",
             "score": 2
        },
        {
            "category": "Z",
             "score": 9
        },
    ],
    "ip": "10.10.10.10",
    "scores": [
        {
            "category": "A",
             "score": 2
        },
        {
            "category": "X",
             "score": 3
        },
    ],
     "ip": "12.12.12.12",
    "scores": [
        {
            "category": "G",
             "score": 10
        },
    ],
}

请注意，RDD不一定要排序，并且RDD可以轻松包含几亿行.我是PySpark的新手，所以有关如何有效实现此目标的任何指示都将有所帮助.

Note that the RDD isn't necessarily sorted and the RDD can easily contain a couple of hundred million rows. I'm new to PySpark so any pointers on how to go about this efficiently would help.

推荐答案

groupBy ip，然后将分组的RDD转换为所需的内容:

groupBy ip and then transform the grouped RDD to what you needed:

rdd.groupBy(lambda r: r.ip).map(
  lambda g: {
    'ip': g[0], 
    'scores': [{'category': x['category'], 'score': x['score']} for x in g[1]]}
).collect()

# [{'ip': '1.2.3.4', 'scores': [{'category': 'X', 'score': 5}, {'category': 'Y', 'score': 2}, {'category': 'Z', 'score': 9}]}, {'ip': '12.12.12.12', 'scores': [{'category': 'G', 'score': 10}]}, {'ip': '10.10.10.10', 'scores': [{'category': 'A', 'score': 2}, {'category': 'X', 'score': 3}]}]

这篇关于PySpark-将RDD转换为JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark-将RDD转换为JSON [英] PySpark - RDD to JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark-将RDD转换为JSON [英] PySpark - RDD to JSON

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭