SPARK - 在数组中的值上加入 2 个数据帧 [英] SPARK - Joining 2 dataframes on values in an array
问题描述
我找不到一个简单而优雅的解决方案.
I can't find an easy and elegant solution to this one.
我有一个带有此列的 df1 :
I have a df1 with this column :
|-- guitars: array (nullable = true)
| |-- element: long (containsNull = true)
我有一个由吉他制成的 df2,一个 id 与我的 df 1 中的 Long 匹配.
I have a df2 made of guitars, and an id matching with the Long in my df 1.
root
|-- guitarId: long (nullable = true)
|-- make: string (nullable = true)
|-- model: string (nullable = true)
|-- type: string (nullable = true)
我想加入我的两个 dfs,显然,我想要一个来自 df2 的结构吉他数组,而不是一个 long 数组.
I want to join my two dfs, obviously, and instead of having an array of long, I want an array of struct guitars from df2.
我正在使用 array_contains()
来加入两个 dfs,但是 spark 在结果 df 的 n 行中爆炸了 df1 中的 n Long 数组.
I'm using array_contains()
to join the two dfs, but spark is exploding the array of n Long in the df1 in n rows in the result df.
之前
| 2|Eric Clapton| [1, 5]| [,,,]|
之后
| 2|Eric Clapton| [1, 5]| [,,,]| 5|Fender|Stratocaster| Electric|
| 2|Eric Clapton| [1, 5]| [,,,]| 1|Gibson| SG| Electric|
将这个 Long 数组列转换为来自其他数据帧的 struct 数组列的最优雅的解决方案是什么?
What would be the most elegant solution to transform this array column of Long into an array column of struct from an other dataframe ?
理想
| 2|Eric Clapton|[[Fender, Stratocaster, Electric],[Gibson, SG, Electric]]| [,,,]|
提前致谢
(顺便说一句,第一个问题,请谦虚:P)
(first question btw, be humble :P)
推荐答案
array_contains()
有效,之后你只需要将结果由玩家分组即可.
array_contains()
works, and you only have to group the result by the player afterwards.
让我们从两个数据集开始,一个用于演奏者,一个用于吉他:
Lets start with two datasets, one for the players and one for the guitars:
val player = Seq(("Eric Clapton", Array(1,5)), ("Paco de Lucia", Array(1,2)), ("Jimi Hendrix", Array(3))).toDF("player", "guitars")
val guitar = Seq((1, "Gibson", "SG", "Electric"), (2, "Faustino Conde", "Media Luna", "Acoustic"), (3, "Pulsebeatguitars", "Spider", "Electric"), (4, "Yamaha", "FG800", "Acoustic"), (5, "Fender", "Stratocaster", "Electric")).toDF("guitarId", "make", "model", "type")
+-------------+-------+
| player|guitars|
+-------------+-------+
| Eric Clapton| [1, 5]|
|Paco de Lucia| [1, 2]|
| Jimi Hendrix| [3]|
+-------------+-------+
+--------+----------------+------------+--------+
|guitarId| make| model| type|
+--------+----------------+------------+--------+
| 1| Gibson| SG|Electric|
| 2| Faustino Conde| Media Luna|Acoustic|
| 3|Pulsebeatguitars| Spider|Electric|
| 4| Yamaha| FG800|Acoustic|
| 5| Fender|Stratocaster|Electric|
+--------+----------------+------------+--------+
为了使分组操作更容易一些,想法是在连接之前将吉他数据集的三列组合成一个结构体:
To make the grouping operation a bit easier, the idea is to combine the three columns of the guitar dataset into a struct before the join:
val guitar2 = guitar.withColumn("guitar", struct('make, 'model, 'type))
加入后,我们按玩家对结果进行分组,得到正确的结果:
After the join, we group the result by the player and get the correct result:
player.join(guitar2, expr("array_contains(guitars, guitarId)"))
.groupBy("player")
.agg(collect_list('guitar))
.show(false)
印刷品
+-------------+----------------------------------------------------------------+
|player |collect_list(guitar) |
+-------------+----------------------------------------------------------------+
|Jimi Hendrix |[[Pulsebeatguitars, Spider, Electric]] |
|Eric Clapton |[[Gibson, SG, Electric], [Fender, Stratocaster, Electric]] |
|Paco de Lucia|[[Gibson, SG, Electric], [Faustino Conde, Media Luna, Acoustic]]|
+-------------+----------------------------------------------------------------+
这篇关于SPARK - 在数组中的值上加入 2 个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!