Spark复杂分组 [英] Spark complex grouping

查看:56
本文介绍了Spark复杂分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark中具有以下数据结构:

I have this data structure in Spark:

val df = Seq(
("Package 1", Seq("address1", "address2", "address3")),
("Package 2", Seq("address3", "address4", "address5", "address6")),
("Package 3", Seq("address7", "address8")),
("Package 4", Seq("address9")),
("Package 5", Seq("address9", "address1")),
("Package 6", Seq("address10")),
("Package 7", Seq("address8"))).toDF("Package", "Destinations")
df.show(20, false)

我需要找到在不同软件包中一起看到的所有地址.看来我找不到有效的方法.我试图进行分组,映射等.理想情况下,给定 df 的结果将是

I need to find all the addresses that were seen together across different packages. Looks like I can't find a way to efficiently do that. I've tried to group, map, etc. Ideally, result of the given df would be

+----+------------------------------------------------------------------------+
| Id |                               Addresses                                |
+----+------------------------------------------------------------------------+
|  1 | [address1, address2, address3, address4, address5, address6, address9] |
|  2 | [address7, address8]                                                   |
|  3 | [address10]                                                            |
+----+------------------------------------------------------------------------+

推荐答案

使用 TreeReduce

  • 对于顺序操作,您将创建一组Set:

    • For the sequential operation you create a Set of Sets:

    • 对于每个新的元素数组,例如[ address 7 address 8 ]-遍历现有集合以检查交集是否为非空:如果是,则将这些元素添加到该集合中

    • For each new Array of elements e.g. [ address 7, address 8] - iterate through existing sets to check if the intersection were non empty: if so then add those elements to that Set

    • 否则,创建一个包含这些元素的新Set

    用于 combine 操作:

    • 对于合并"操作左侧的每个集合:-遍历右侧的所有集合以找到任何非空交集-如果发现任何非空插入,则将这两个集合合并.

    注意 TreeReduce 是较新的命名. TreeAggregate 用于旧版本的Spark

    Note TreeReduce is the newer naming. TreeAggregate is used in older versions of Spark

    这篇关于Spark复杂分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆