根据列值加入 [英] joined based on the column value
问题描述
我正在使用 spark-sql-2.4.1v如何进行各种连接取决于列的值
I am using spark-sql-2.4.1v how to do various joins depend on the value of column
样本数据
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school| 14| 12|
| 21|score|school| 13| 13|
| 22| rate|school| 11| 14|
| 21| rate|school| 13| 12|
基于代码"我需要与其他各种表连接的列值
based the "code" column value i need to do join with various other tables
val rateDs = // val data1= List(
("22", 11 ,A),
("22", 14 ,B),
("20", 13 ,C),
("21", 12 ,C),
("21", 13 ,D)
)
val df = data1.toDF(id",map_code",map_val")
val df = data1.toDF("id", "map_code","map_val")
val scoreDs = // scoreTable
如果代码"列值为rate"我需要加入rateDs如果代码"列值为分数"我需要加入scoreDs
if the "code" column value is "rate" i need to join with rateDs if the "code" column value is "score" i need to join with scoreDs
如何在 spark 中处理这些事情?实现这一目标的最佳方法是什么?
how to handle these kind of things in spark ? any optimum way to achieve this?
rate"的预期结果字段
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school| A| B |
| 21| rate|school| D| C |
推荐答案
你可以简单地加入两次,例如
You can simply join twice, for example
val data = List(
("20", "score", "school", 14 , 12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 , 14),
("21", "rate", "school", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
val data1 = List(
("22", 11 ,"A"),
("22", 14 ,"B"),
("20", 13 ,"C"),
("21", 12 ,"C"),
("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")
df.as("a")
.join(rateDF.as("b"),
col("a.code") === lit("rate")
&& col("a.id") === col("b.id")
&& col("a.value1") === col("b.map_code"), "inner")
.join(rateDF.as("c"),
col("a.code") === lit("rate")
&& col("a.id") === col("c.id")
&& col("a.value2") === col("c.map_code"), "inner")
.select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
.show(false)
+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A |B |
|21 |rate|school|D |C |
+---+----+------+------+------+
嗯,这看起来有点脏,但我不知道多列...
Well, this looks a bit dirty, but I have no idea for the multiple columns...
这篇关于根据列值加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!