Pyspark在groupby中创建字典 [英] Pyspark create dictionary within groupby

查看:119
本文介绍了Pyspark在groupby中创建字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在pyspark中是否可以在 groupBy.agg()中创建字典?这是一个玩具示例:

Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example:

import pyspark
from pyspark.sql import Row
import pyspark.sql.functions as F

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

toy_data = spark.createDataFrame([
    Row(id=1, key='a', value="123"),
    Row(id=1, key='b', value="234"),
    Row(id=1, key='c', value="345"),
    Row(id=2, key='a', value="12"),
    Row(id=2, key='x', value="23"),
    Row(id=2, key='y', value="123")])

toy_data.show()

+---+---+-----+
| id|key|value|
+---+---+-----+
|  1|  a|  123|
|  1|  b|  234|
|  1|  c|  345|
|  2|  a|   12|
|  2|  x|   23|
|  2|  y|  123|
+---+---+-----+

这是预期的输出:

---+------------------------------------
id |  key_value
---+------------------------------------
1  | {"a": "123", "b": "234", "c": "345"}
2  | {"a": "12", "x": "23", "y": "123"}
---+------------------------------------

====================================

======================================

我尝试了一下,但是没有用.

I tried this but doesn't work.

toy_data.groupBy("id").agg(
    F.create_map(col("key"),col("value")).alias("key_value")
)

这会产生以下错误:

AnalysisException: u"expression '`key`' is neither present in the group by, nor is it an aggregate function....

推荐答案

agg 组件必须包含实际的聚合功能.解决此问题的一种方法是结合 收集列表

The agg component has to contain actual aggregation function. One way to approach this is to combine collect_list

集合函数:返回具有重复项的对象列表.

Aggregate function: returns a list of objects with duplicates.

struct :

创建一个新的结构体列.

Creates a new struct column.

map_from_entries

集合函数:返回根据给定的条目数组创建的地图.

Collection function: Returns a map created from the given array of entries.

这是您要这样做的方式:

This is how you'd do that:

toy_data.groupBy("id").agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("key", "value"))).alias("key_value")
).show(truncate=False)

+---+------------------------------+
|id |key_value                     |
+---+------------------------------+
|1  |[a -> 123, b -> 234, c -> 345]|
|2  |[a -> 12, x -> 23, y -> 123]  |
+---+------------------------------+

这篇关于Pyspark在groupby中创建字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆