加入查找数据集后进行多列值查找 [英] Doing multiple column value look up after joining with lookup dataset
本文介绍了加入查找数据集后进行多列值查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用 spark-sql-2.4.1v 如何根据列的值进行各种连接我需要为给定的值列获取 map_val
列的多个查找值,如下所示.
I am using spark-sql-2.4.1v how to do various joins depend on the value of column
I need get multiple look up values of map_val
column for given value columns as show below.
示例数据:
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12),
("21", "score", "school", "2018-03-31", 13 , 13),
("22", "rate", "school", "2018-03-31", 11 , 14),
("21", "rate", "school", "2018-03-31", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "date", "value1", "value2")
df.show
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| 11| 14|
| 21| rate|school|2018-03-31| 13| 12|
+---+-----+------+----------+------+------+
查找数据集rateDs
:
val rateDs = List(
("21","2018-01-31","2018-06-31", 12 ,"C"),
("21","2018-01-31","2018-06-31", 13 ,"D")
).toDF("id","start_date","end_date", "map_code","map_val")
rateDs.show
+---+----------+----------+--------+-------+
| id|start_date| end_date|map_code|map_val|
+---+----------+----------+--------+-------+
| 21|2018-01-31|2018-06-31| 12| C|
| 21|2018-01-31|2018-06-31| 13| D|
+---+----------+----------+--------+-------+
基于start_date
和end_date
加入map_val
列的查找表:
Joining with lookup table for map_val
column based on start_date
and end_date
:
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(rateDs ,
(
df.col("date").between(rateDs.col("start_date"), rateDs.col("end_date"))
.and(rateDs.col("id").equalTo(df.col("id")))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
)
, "left"
)
//.drop("start_date")
//.drop("end_date")
resultDs.show
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
预期的输出应该是:
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
如果需要更多详细信息,请告诉我.
Please let me know if any more details are needed.
推荐答案
试试这个-
在每个 id 加入之前创建查找图并使用相同的来替换
val newRateDS = rateDs.withColumn("lookUpMap",
map_from_entries(collect_list(struct(col("map_code"), col("map_val"))).over(Window.partitionBy("id")))
)
newRateDS.show(false)
/**
* +---+----------+----------+--------+-------+------------------+
* |id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----------+----------+--------+-------+------------------+
* |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* +---+----------+----------+--------+-------+------------------+
*/
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(broadcast(newRateDS) ,
rateDs("id") === df("id") && df("date").between(rateDs("start_date"), rateDs("end_date"))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
, "left"
)
resultDs.withColumn("value1", expr("coalesce(lookUpMap[value1], value1)"))
.withColumn("value2", expr("coalesce(lookUpMap[value2], value2)"))
.show(false)
/**
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |id |code|entity|date |value1|value2|id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |22 |rate|school|2018-03-31|11 |14 |null|null |null |null |null |null |
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
*/
这篇关于加入查找数据集后进行多列值查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文