根据列值联接 [英] joined based on the column value
问题描述
我正在使用spark-sql-2.4.1v 如何进行各种连接取决于column的值
I am using spark-sql-2.4.1v how to do various joins depend on the value of column
样本数据
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school| 14| 12|
| 21|score|school| 13| 13|
| 22| rate|school| 11| 14|
| 21| rate|school| 13| 12|
基于代码"列值,我需要与其他各种表联接
based the "code" column value i need to do join with various other tables
val rateDs = // val data1= List(
("22", 11 ,A),
("22", 14 ,B),
("20", 13 ,C),
("21", 12 ,C),
("21", 13 ,D)
)
val df = data1.toDF("id","map_code","map_val")
val df = data1.toDF("id", "map_code","map_val")
val scoreDs = // scoreTable
如果代码"列值是"rate".我需要加入rateDs 如果代码"是列值是分数"我需要加入scoreDs
if the "code" column value is "rate" i need to join with rateDs if the "code" column value is "score" i need to join with scoreDs
如何处理这类事情?有什么最佳方法可以实现这一目标?
how to handle these kind of things in spark ? any optimum way to achieve this?
预期的费率"结果字段
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school| A| B |
| 21| rate|school| D| C |
推荐答案
例如,您可以简单地加入两次
You can simply join twice, for example
val data = List(
("20", "score", "school", 14 , 12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 , 14),
("21", "rate", "school", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
val data1 = List(
("22", 11 ,"A"),
("22", 14 ,"B"),
("20", 13 ,"C"),
("21", 12 ,"C"),
("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")
df.as("a")
.join(rateDF.as("b"),
col("a.code") === lit("rate")
&& col("a.id") === col("b.id")
&& col("a.value1") === col("b.map_code"), "inner")
.join(rateDF.as("c"),
col("a.code") === lit("rate")
&& col("a.id") === col("c.id")
&& col("a.value2") === col("c.map_code"), "inner")
.select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
.show(false)
+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A |B |
|21 |rate|school|D |C |
+---+----+------+------+------+
好吧,这看起来有点脏,但是我不知道多列...
Well, this looks a bit dirty, but I have no idea for the multiple columns...
这篇关于根据列值联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!