如何按数组中的公共元素分组? [英] How to group by common element in array?
问题描述
我试图在 spark 中找到解决方案,将数据与数组中的公共元素分组.
I am trying to find the solution in spark to group data with a common element in an array.
key value
[k1,k2] v1
[k2] v2
[k3,k2] v3
[k4] v4
如果key中的任何元素匹配,我们必须为其分配相同的groupid.(Groupby common element)
If any element matches in key, we have to assign same groupid to that.(Groupby common element)
结果:
key value GroupID
[k1,k2] v1 G1
[k2] v2 G1
[k3,k2] v3 G1
[k4] v4 G2
Spark Graphx 已经给出了一些建议,但目前学习曲线将更多地用于单个功能的实现.
Some suggestions are already given with Spark Graphx, but at this moment learning curve will be more to implement this for a single feature.
推荐答案
Include graphframes
(最新支持的 Spark 版本是 2.1,但它也应该支持 2.2,如果您使用更新的,则必须使用 2.3 补丁构建自己的)将 XXX
替换为 Spark 版本和 YYY
Scala 版本:
Include graphframes
(the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX
with Spark version and YYY
with Scala version:
spark.jars.packages graphframes:graphframes:0.5.0-sparkXXX-s_YYY
添加爆炸键:
import org.apache.spark.sql.functions._
val df = Seq(
(Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
(Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")
val edges = df.select(
explode($"key") as "src", $"value" as "dst")
转换为graphframe
:
import org.graphframes._
val gf = GraphFrame.fromEdges(edges)
设置检查点目录(如果没有设置):
Set checkpoint directory (if not set):
import org.apache.spark.sql.SparkSession
val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)
查找连接的组件:
val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run
用输入数据连接结果:
val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))
检查结果:
result.show
// +-----+------------+--------+
// |value| group| key|
// +-----+------------+--------+
// | v3|489626271744|[k3, k2]|
// | v2|489626271744| [k2]|
// | v4|532575944704| [k4]|
// | v1|489626271744|[k1, k2]|
// +-----+------------+--------+
这篇关于如何按数组中的公共元素分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!