通过交集分组pyspark数据帧 [英] Grouping pyspark dataframe by intersection

查看:58
本文介绍了通过交集分组pyspark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过列中数组的交集对PySpark数据帧进行分组.例如,从这样的数据框中获取:

I need to group PySpark dataframe by intersection of arrays in column. For example from dataframe like this:

v1 | [1, 2, 3]
v2 | [4, 5]
v3 | [1, 7]

结果应为:

[v1, v3] | [1, 2, 3, 7]
[v2] | [4, 5]

因为第1行和第3行的值共有1.

Because rows 1st and 3rd have value 1 in common.

交集时有类似分组的方法吗?

Is there a method like group by when intersection?

预先感谢您提出解决方案的想法和建议.

Thank you in advance for ideas and suggestions how to solve this.

推荐答案

from pyspark.sql import functions as F

df = spark.createDataFrame([["v1", [1,2,3]], ["v2", [4,5]], ["v3",[1,7]]],["id","arr"])

df1= df.select("*", F.explode("arr").alias("explode_arr")).groupBy("explode_arr").agg(F.collect_set("id").alias("ids"))

df2=df.select("*", F.explode("arr").alias("explode_arr")).join(df1, ["explode_arr"],\
    "inner").groupBy("ids").agg(F.collect_set("arr").alias("array_set")).\
    select("ids",F.array_distinct(F.expr("flatten(array_set)")).alias("intersection_arrays"))

df3= df2.where(F.size("ids")>1).select(F.explode("ids").alias("ids")).select(F.array("ids").alias("ids"))

df4= df2.join(df3.withColumn("flag", F.lit(1)),["ids"],"left_outer").where(F.col("flag").isNull()).drop("flag")

df4.show()

+--------+-------------------+
|     ids|intersection_arrays|
+--------+-------------------+
|    [v2]|             [4, 5]|
|[v3, v1]|       [1, 7, 2, 3]|
+--------+-------------------+ 

这篇关于通过交集分组pyspark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆