如何按数组中的公共元素分组? [英] How to group by common element in array?

查看:110
本文介绍了如何按数组中的公共元素分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在spark中找到解决方案,以将数据与数组中的公共元素分组.

I am trying to find the solution in spark to group data with a common element in an array.

 key                            value
[k1,k2]                         v1
[k2]                            v2
[k3,k2]                         v3
[k4]                            v4

如果键中有任何元素匹配,我们必须为其分配相同的groupid.(Groupby通用元素)

If any element matches in key, we have to assign same groupid to that.(Groupby common element)

结果:

key                             value  GroupID
[k1,k2]                           v1    G1
[k2]                              v2    G1
[k3,k2]                           v3    G1 
[k4]                              v4    G2

Spark Graphx已经给出了一些建议,但是目前学习曲线将更多地用于单个功能.

Some suggestions are already given with Spark Graphx, but at this moment learning curve will be more to implement this for a single feature.

推荐答案

包括 graphframes (最新支持的Spark版本是2.1,但它也应该支持2.2,如果使用较新的版本,则必须使用2.3补丁程序构建自己的版本),将XXX替换为Spark版本,将YYY替换为Scala版本:

Include graphframes (the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX with Spark version and YYY with Scala version:

spark.jars.packages  graphframes:graphframes:0.5.0-sparkXXX-s_YYY

添加爆炸键:

import org.apache.spark.sql.functions._

val df = Seq(
   (Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
   (Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")

val edges = df.select(
  explode($"key") as "src", $"value" as "dst")

转换为graphframe:

import org.graphframes._

val gf = GraphFrame.fromEdges(edges)

设置检查点目录(如果未设置):

Set checkpoint directory (if not set):

import org.apache.spark.sql.SparkSession

val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)

查找连接的组件:

val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run

将结果与输入数据结合起来

Join result with input data:

 val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))

检查结果:

result.show

// +-----+------------+--------+
// |value|       group|     key|
// +-----+------------+--------+
// |   v3|489626271744|[k3, k2]|
// |   v2|489626271744|    [k2]|
// |   v4|532575944704|    [k4]|
// |   v1|489626271744|[k1, k2]|
// +-----+------------+--------+

这篇关于如何按数组中的公共元素分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆