如何按数组中的公共元素分组? [英] How to group by common element in array?

查看:20
本文介绍了如何按数组中的公共元素分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在 spark 中找到解决方案,将数据与数组中的公共元素分组.

I am trying to find the solution in spark to group data with a common element in an array.

 key                            value
[k1,k2]                         v1
[k2]                            v2
[k3,k2]                         v3
[k4]                            v4

如果key中的任何元素匹配,我们必须为其分配相同的groupid.(Groupby common element)

If any element matches in key, we have to assign same groupid to that.(Groupby common element)

结果:

key                             value  GroupID
[k1,k2]                           v1    G1
[k2]                              v2    G1
[k3,k2]                           v3    G1 
[k4]                              v4    G2

Spark Graphx 已经给出了一些建议,但目前学习曲线将更多地用于单个功能的实现.

Some suggestions are already given with Spark Graphx, but at this moment learning curve will be more to implement this for a single feature.

推荐答案

Include graphframes(最新支持的 Spark 版本是 2.1,但它也应该支持 2.2,如果您使用更新的,则必须使用 2.3 补丁构建自己的)将 XXX 替换为 Spark 版本和 YYY Scala 版本:

Include graphframes (the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX with Spark version and YYY with Scala version:

spark.jars.packages  graphframes:graphframes:0.5.0-sparkXXX-s_YYY

添加爆炸键:

import org.apache.spark.sql.functions._

val df = Seq(
   (Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
   (Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")

val edges = df.select(
  explode($"key") as "src", $"value" as "dst")

转换为graphframe:

import org.graphframes._

val gf = GraphFrame.fromEdges(edges)

设置检查点目录(如果没有设置):

Set checkpoint directory (if not set):

import org.apache.spark.sql.SparkSession

val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)

查找连接的组件:

val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run

用输入数据连接结果:

 val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))

检查结果:

result.show

// +-----+------------+--------+
// |value|       group|     key|
// +-----+------------+--------+
// |   v3|489626271744|[k3, k2]|
// |   v2|489626271744|    [k2]|
// |   v4|532575944704|    [k4]|
// |   v1|489626271744|[k1, k2]|
// +-----+------------+--------+

这篇关于如何按数组中的公共元素分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆