GroupBy 和 concat 数组列 pyspark [英] GroupBy and concat array columns pyspark
本文介绍了GroupBy 和 concat 数组列 pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这个数据框
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
我想转换成以下df:
+-----+------------------+
|store| values |
+-----+------------------+
| 1|[1, 2, 3, 4, 5, 6]|
| 2| [2, 3]|
+-----+------------------+
我是这样做的:
from pyspark.sql import functions as F
df.groupBy("store").agg(F.collect_list("values"))
但解决方案有这个WrappedArrays
+-----+----------------------------------------------+
|store|collect_list(values) |
+-----+----------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|
|2 |[WrappedArray(2), WrappedArray(3)] |
+-----+----------------------------------------------+
有什么方法可以将 WrappedArrays
转换成串联数组?或者我可以做不同的事情吗?
Is there any way to transform the WrappedArrays
into concatenated arrays? Or can I do it differently?
谢谢!
推荐答案
你需要一个扁平化的 UDF;从你自己的df
开始:
You need a flattening UDF; starting from your own df
:
spark.version
# u'2.2.0'
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.IntegerType()))
df2 = df.groupBy("store").agg(F.collect_list("values"))
df2.show(truncate=False)
# +-----+----------------------------------------------+
# |store| collect_list(values) |
# +-----+----------------------------------------------+
# |1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|
# |2 |[WrappedArray(2), WrappedArray(3)] |
# +-----+----------------------------------------------+
df3 = df2.select("store", flattenUdf("collect_list(values)").alias("values"))
df3.show(truncate=False)
# +-----+------------------+
# |store| values |
# +-----+------------------+
# |1 |[1, 2, 3, 4, 5, 6]|
# |2 |[2, 3] |
# +-----+------------------+
更新(评论后):
以上代码段仅适用于 Python 2.对于 Python 3,您应该按如下方式修改 UDF:
The above snippet will work only with Python 2. With Python 3, you should modify the UDF as follows:
import functools
def fudf(val):
return functools.reduce(lambda x, y:x+y, val)
使用 Spark 2.4.4 进行测试.
Tested with Spark 2.4.4.
这篇关于GroupBy 和 concat 数组列 pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文