pyspark collect_set 或 collect_list 与 groupby [英] pyspark collect_set or collect_list with groupby
本文介绍了pyspark collect_set 或 collect_list 与 groupby的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何在 groupby
之后在数据帧上使用 collect_set
或 collect_list
.例如:df.groupby('key').collect_set('values')
.我收到一个错误:AttributeError: 'GroupedData' object has no attribute 'collect_set'
How can I use collect_set
or collect_list
on a dataframe after groupby
. for example: df.groupby('key').collect_set('values')
. I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'
推荐答案
你需要使用 agg.示例:
You need to use agg. Example:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
注意在上面你必须创建一个 HiveContext.请参阅 https://stackoverflow.com/a/35529093/690430 以了解如何处理不同的 Spark 版本.
Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
这篇关于pyspark collect_set 或 collect_list 与 groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文