如何将多个数组分组为一个,然后展平并找到不同的项目 [英] How to group multiple arrays into one, then flatten and find distinct items

查看:119
本文介绍了如何将多个数组分组为一个,然后展平并找到不同的项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有如下数据框:

val df = Seq(
  (1, Seq("USD", "CAD")),
  (2, Seq("AUD", "YEN", "USD")),
  (2, Seq("GBP", "AUD", "YEN")),
  (3, Seq("BRL", "AUS", "BND","BOB","BWP")),
  (3, Seq("XAF", "CLP", "BRL")),
  (3, Seq("XAF", "CNY", "KMF","CSK","EGP")
  )
).toDF("ACC", "CCY")

+---+-------------------------+
|ACC|CCY                      |
+---+-------------------------+
|1  |[USD, CAD]               |
|2  |[AUD, YEN, USD]          |
|2  |[GBP, AUD, YEN]          |
|3  |[BRL, AUS, BND, BOB, BWP]|
|3  |[XAF, CLP, BRL]          |
|3  |[XAF, CNY, KMF, CSK, EGP]|
+---+-------------------------+

必须通过删除重复项来进行如下转换.

This has to be transformed as below by removing the duplicates too.

火花版本= 2.0 Scala版本= 2.10

Spark Version = 2.0 Scala Version = 2.10

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD,CAD]                                              |
|2  |[AUD,YEN,USD,GBP]                                      |
|3  |[BRL,AUS,BND,BOB,BWP,XAF,CLP,CNY,KMF,CSK,EGP]          |
+---+-------------------------------------------------------+

我尝试按ACC列分组并汇总CCY,但不确定从那里开始.

I tried grouping by ACC column and aggregating the CCY but not sure where to go from there.

可以在不使用UDF的情况下完成此操作吗?如果否,那么我将如何使用UDF进行处理? 请指教.

Can this be done without using UDF? If NO, then how would I go about this using UDF? Please advice.

推荐答案

下一个代码应返回预期结果:

The next code should return the expected results:

    import scala.collection.mutable.WrappedArray
    val df = Seq(
      (1, Seq("USD", "CAD")),
      (2, Seq("AUD", "YEN", "USD")),
      (2, Seq("GBP", "AUD", "YEN")),
      (3, Seq("BRL", "AUS", "BND", "BOB", "BWP")),
      (3, Seq("XAF", "CLP", "BRL")),
      (3, Seq("XAF", "CNY", "KMF", "CSK", "EGP")
      )
    ).toDF("ACC", "CCY")

    val castToArray = udf((ccy: WrappedArray[WrappedArray[String]]) => ccy.flatten.distinct.toArray)

    val df2 = df.groupBy($"ACC")
      .agg(collect_list($"CCY").as("CCY"))
      .withColumn("CCY", castToArray($"CCY"))
        .show(false)

首先,我使用groupBy("ACC"),然后使用汇总collect_list将所有数组集中到一个数组中.接下来,将CCY的udf函数值展开,并将结果展平.

First I use groupBy("ACC") then with the aggregate collect_list all arrays are concentrated into one. Next, inside the udf function values of CCY are being unwrapped and the results are flattened.

输出:

+---+-------------------------------------------------------+
|ACC|CCY                                                    |
+---+-------------------------------------------------------+
|1  |[USD, CAD]                                             |
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]|
|2  |[AUD, YEN, USD, GBP]                                   |
+---+-------------------------------------------------------+

祝你好运

更新:

在Spark> = 2.4中,您可以使用内置的flatten和array_distinct函数,并避免使用udf:

In Spark >= 2.4 you can use the build-in flatten and array_distinct functions and avoid the usage of udf:

df.groupBy($"ACC")
          .agg(collect_list($"CCY").as("CCY"))
          .select($"ACC", array_distinct(flatten($"CCY")).as("CCY"))
          .show(false)

//Output
+---+-------------------------------------------------------+ 
|ACC|CCY                                                    | 
+---+-------------------------------------------------------+ 
|1  |[USD, CAD]                                             | 
|3  |[BRL, AUS, BND, BOB, BWP, XAF, CLP, CNY, KMF, CSK, EGP]| 
|2  |[AUD, YEN, USD, GBP]                                   | 
+---+-------------------------------------------------------+

这篇关于如何将多个数组分组为一个,然后展平并找到不同的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆