如何将多个数组组合为一个,然后展平并找到不同的项目 [英] How to group multiple arrays into one, then flatten and find distinct items

查看:19
本文介绍了如何将多个数组组合为一个,然后展平并找到不同的项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有如下所示的数据框:

Having a dataframe like below:

val df = Seq(
  (1, Seq("USD", "CAD")),
  (2, Seq("AUD", "YEN", "USD")),
  (2, Seq("GBP", "AUD", "YEN")),
  (3, Seq("BRL", "AUS", "BND","BOB","BWP")),
  (3, Seq("XAF", "CLP", "BRL")),
  (3, Seq("XAF", "CNY", "KMF","CSK","EGP")
  )
).toDF("ACC", "CCY")

+---+-------------------------+
|ACC|CCY                      |
+---+-------------------------+
|1  |[USD, CAD]               |
|2  |[AUD, YEN, USD]          |
|2  |[GBP, AUD, YEN]          |
|3  |[BRL, AUS, BND, BOB, BWP]|
|3  |[XAF, CLP, BRL]          |
|3  |[XAF, CNY, KMF, CSK, EGP]|
+---+-------------------------+

这也必须通过删除重复项进行如下转换.

This has to be transformed as below by removing the duplicates too.

Spark 版本 = 2.0Scala 版本 = 2.10

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD,CAD]                                              |
|2  |[AUD,YEN,USD,GBP]                                      |
|3  |[BRL,AUS,BND,BOB,BWP,XAF,CLP,CNY,KMF,CSK,EGP]          |
+---+-------------------------------------------------------+

我尝试按 ACC 列分组并聚合 CCY,但不确定从那里开始.

I tried grouping by ACC column and aggregating the CCY but not sure where to go from there.

这可以不使用 UDF 来完成吗?如果否,那么我将如何使用 UDF 解决这个问题?请指教.

Can this be done without using UDF? If NO, then how would I go about this using UDF? Please advice.

推荐答案

接下来的代码应该返回预期的结果:

The next code should return the expected results:

    import scala.collection.mutable.WrappedArray
    val df = Seq(
      (1, Seq("USD", "CAD")),
      (2, Seq("AUD", "YEN", "USD")),
      (2, Seq("GBP", "AUD", "YEN")),
      (3, Seq("BRL", "AUS", "BND", "BOB", "BWP")),
      (3, Seq("XAF", "CLP", "BRL")),
      (3, Seq("XAF", "CNY", "KMF", "CSK", "EGP")
      )
    ).toDF("ACC", "CCY")

    val castToArray = udf((ccy: WrappedArray[WrappedArray[String]]) => ccy.flatten.distinct.toArray)

    val df2 = df.groupBy($"ACC")
      .agg(collect_list($"CCY").as("CCY"))
      .withColumn("CCY", castToArray($"CCY"))
        .show(false)

首先我使用 groupBy("ACC") 然后使用聚合 collect_list 所有数组都集中为一个.接下来,CCY 的 udf 函数值被解包,结果被展平.

First I use groupBy("ACC") then with the aggregate collect_list all arrays are concentrated into one. Next, inside the udf function values of CCY are being unwrapped and the results are flattened.

输出:

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD, CAD]                                             |
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2  |[AUD, YEN, USD, GBP]                                   |
+---+-------------------------------------------------------+

祝你好运

更新:

在 Spark >= 2.4 中,您可以使用内置的 flatten 和 array_distinct 函数并避免使用 udf:

In Spark >= 2.4 you can use the build-in flatten and array_distinct functions and avoid the usage of udf:

df.groupBy($"ACC")
          .agg(collect_list($"CCY").as("CCY"))
          .select($"ACC", array_distinct(flatten($"CCY")).as("CCY"))
          .show(false)

//Output
+---+-------------------------------------------------------+ 
|ACC|CCY                                                    | 
+---+-------------------------------------------------------+ 
|1  |[USD, CAD]                                             | 
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]| 
|2  |[AUD, YEN, USD, GBP]                                   | 
+---+-------------------------------------------------------+

这篇关于如何将多个数组组合为一个,然后展平并找到不同的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆