pyspark - 在地图类型结构中创建 DataFrame 分组列 [英] pyspark - create DataFrame Grouping columns in map type structure
问题描述
我的DataFrame 具有以下结构:
-------------------------
| Brand | type | amount|
-------------------------
| B | a | 10 |
| B | b | 20 |
| C | c | 30 |
-------------------------
我想通过将 type
和 amount
分组为一列 type 来减少行数:Map代码>所以
Brand
将是唯一的,MAP_type_AMOUNT
将为每个 type
amount
有 key,value
> 组合.
I want to reduce the amount of rows by grouping type
and amount
into one single column of type: Map
So Brand
will be unique and MAP_type_AMOUNT
will have key,value
for each type
amount
combination.
我认为 Spark.sql 可能有一些函数可以帮助完成这个过程,或者我是否必须让 RDD 成为 DataFrame 并自己"转换为映射类型?
I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the DataFrame and make my "own" conversion to map type?
预期:
-------------------------
| Brand | MAP_type_AMOUNT
-------------------------
| B | {a: 10, b:20} |
| C | {c: 30} |
-------------------------
推荐答案
对 Prem 的 答案(抱歉我还不能评论)
Slight improvement to Prem's answer (sorry I can't comment yet)
使用 func.create_map
而不是 func.struct
.请参阅文档
Use func.create_map
instead of func.struct
. See documentation
import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),
('C','c',30)]).toDF(['Brand','Type','Amount'])
df_converted = df.groupBy("Brand").\
agg(func.collect_list(func.create_map(func.col("Type"),
func.col("Amount"))).alias("MAP_type_AMOUNT"))
print df_converted.collect()
输出:
[Row(Brand=u'B', MAP_type_AMOUNT=[{u'a': 10}, {u'b': 20}]),
Row(Brand=u'C', MAP_type_AMOUNT=[{u'c': 30}])]
这篇关于pyspark - 在地图类型结构中创建 DataFrame 分组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!