如何按数组中的公共元素分组? [英] How to group by common element in array?
问题描述
我试图在spark中找到解决方案,以将数据与数组中的公共元素分组.
I am trying to find the solution in spark to group data with a common element in an array.
key value
[k1,k2] v1
[k2] v2
[k3,k2] v3
[k4] v4
如果键中有任何元素匹配,我们必须为其分配相同的groupid.(Groupby通用元素)
If any element matches in key, we have to assign same groupid to that.(Groupby common element)
结果:
key value GroupID
[k1,k2] v1 G1
[k2] v2 G1
[k3,k2] v3 G1
[k4] v4 G2
Spark Graphx已经给出了一些建议,但是目前学习曲线将更多地用于单个功能.
Some suggestions are already given with Spark Graphx, but at this moment learning curve will be more to implement this for a single feature.
推荐答案
包括 graphframes
(最新支持的Spark版本是2.1,但它也应该支持2.2,如果使用较新的版本,则必须使用2.3补丁程序构建自己的版本),将XXX
替换为Spark版本,将YYY
替换为Scala版本:
Include graphframes
(the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX
with Spark version and YYY
with Scala version:
spark.jars.packages graphframes:graphframes:0.5.0-sparkXXX-s_YYY
添加爆炸键:
import org.apache.spark.sql.functions._
val df = Seq(
(Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
(Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")
val edges = df.select(
explode($"key") as "src", $"value" as "dst")
转换为graphframe
:
import org.graphframes._
val gf = GraphFrame.fromEdges(edges)
设置检查点目录(如果未设置):
Set checkpoint directory (if not set):
import org.apache.spark.sql.SparkSession
val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)
查找连接的组件:
val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run
将结果与输入数据结合起来
Join result with input data:
val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))
检查结果:
result.show
// +-----+------------+--------+
// |value| group| key|
// +-----+------------+--------+
// | v3|489626271744|[k3, k2]|
// | v2|489626271744| [k2]|
// | v4|532575944704| [k4]|
// | v1|489626271744|[k1, k2]|
// +-----+------------+--------+
这篇关于如何按数组中的公共元素分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!