SPARK-在数组中的值上连接2个数据框 [英] SPARK - Joining 2 dataframes on values in an array

查看:26
本文介绍了SPARK-在数组中的值上连接2个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到一个简单而优雅的解决方案.

I can't find an easy and elegant solution to this one.

我在此列中有一个df1:

I have a df1 with this column :

|-- guitars: array (nullable = true)
 |    |-- element: long (containsNull = true)

我有一个用吉他制成的df2,还有一个与我的df 1中的Long匹配的ID.

I have a df2 made of guitars, and an id matching with the Long in my df 1.

root
 |-- guitarId: long (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- type: string (nullable = true)

很明显,我想加入我的两个df,并且我想要一个df2的结构吉他,而不是一堆长的吉他.

I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct guitars from df2.

我正在使用 array_contains()来连接两个df,但是spark在结果df中的n行中爆炸了df1中的n Long数组.

I'm using array_contains() to join the two dfs, but spark is exploding the array of n Long in the df1 in n rows in the result df.

之前

|   2|Eric Clapton| [1, 5]|               [,,,]|

之后

|   2|Eric Clapton| [1, 5]|               [,,,]|       5|Fender|Stratocaster|            Electric|
|   2|Eric Clapton| [1, 5]|               [,,,]|       1|Gibson|          SG|            Electric|

将Long的此数组列从另一个数据帧转换为struct的数组列的最优雅的解决方案是什么?

What would be the most elegant solution to transform this array column of Long into an array column of struct from an other dataframe ?

理想

|   2|Eric Clapton|[[Fender, Stratocaster, Electric],[Gibson, SG, Electric]]|               [,,,]|

预先感谢

(顺便问一下,第一个问题:P)

(first question btw, be humble :P)

推荐答案

array_contains()起作用,然后您只需要按玩家将结果分组即可.

array_contains() works, and you only have to group the result by the player afterwards.

让我们从两个数据集开始,一个用于演奏者,一个用于吉他:

Lets start with two datasets, one for the players and one for the guitars:

val player = Seq(("Eric Clapton", Array(1,5)), ("Paco de Lucia", Array(1,2)), ("Jimi Hendrix", Array(3))).toDF("player", "guitars")
val guitar = Seq((1, "Gibson", "SG", "Electric"), (2, "Faustino Conde", "Media Luna", "Acoustic"), (3, "Pulsebeatguitars", "Spider", "Electric"), (4, "Yamaha", "FG800", "Acoustic"), (5, "Fender", "Stratocaster", "Electric")).toDF("guitarId", "make", "model", "type")

+-------------+-------+
|       player|guitars|
+-------------+-------+
| Eric Clapton| [1, 5]|
|Paco de Lucia| [1, 2]|
| Jimi Hendrix|    [3]|
+-------------+-------+
+--------+----------------+------------+--------+
|guitarId|            make|       model|    type|
+--------+----------------+------------+--------+
|       1|          Gibson|          SG|Electric|
|       2|  Faustino Conde|  Media Luna|Acoustic|
|       3|Pulsebeatguitars|      Spider|Electric|
|       4|          Yamaha|       FG800|Acoustic|
|       5|          Fender|Stratocaster|Electric|
+--------+----------------+------------+--------+

为了使分组操作更容易一些,我们的想法是在联接之前将吉他数据集的三列合并为一个结构:

To make the grouping operation a bit easier, the idea is to combine the three columns of the guitar dataset into a struct before the join:

val guitar2 = guitar.withColumn("guitar", struct('make, 'model, 'type))

加入后,我们将结果按玩家分组并得到正确的结果:

After the join, we group the result by the player and get the correct result:

player.join(guitar2, expr("array_contains(guitars, guitarId)"))
  .groupBy("player")
  .agg(collect_list('guitar))
  .show(false)

打印

+-------------+----------------------------------------------------------------+
|player       |collect_list(guitar)                                            |
+-------------+----------------------------------------------------------------+
|Jimi Hendrix |[[Pulsebeatguitars, Spider, Electric]]                          |
|Eric Clapton |[[Gibson, SG, Electric], [Fender, Stratocaster, Electric]]      |
|Paco de Lucia|[[Gibson, SG, Electric], [Faustino Conde, Media Luna, Acoustic]]|
+-------------+----------------------------------------------------------------+

这篇关于SPARK-在数组中的值上连接2个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆