数据框查找和优化 [英] dataframe look up and optimization
问题描述
我在Java中使用spark-sql-2.4.3v. 我有下面的情况
I am using spark-sql-2.4.3v with java. I have scenario below
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14),
("23", "score", "school", 11 ,14),
("24", "rate", "school", 12 ,12),
("25", "score", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
df.show
//this look up data populated from DB.
val ll = List(
("aaaa", 11),
("aaa", 12),
("aa", 13),
("a", 14)
)
val codeValudeDf = ll.toDF( "code", "value")
codeValudeDf.show
我需要映射代码"带有值"在最终输出中,仅针对那些具有代码"的行/记录.作为分数";在数据"中数据框.
I need to map "code" with "value" in the final output, only for those rows/records which has "code" as "score" in the "data" dataframe.
如何从codeValudeDf中查找哈希图,以便获得如下输出结果
How can i make a look up hashmap from codeValudeDf , so that I can get output as below
+---+-----+-------+------+-----+
| id| code|entity|value1|value2|
+---+-----+-------+------+-----+
| 20|score|school| a| aaa|
| 21|score|school| aa| aa|
| 22| rate|school| 11| 14|
| 23|score|school| aaaa| a|
| 24| rate|school| 12| 12|
| 25|score|school| aaaa| a|
+---+-----+------+------+------+
是否有可能使此查询最佳化,即每次我都不应该从DB中提取数据帧数据时?
Is there any possibility to make this look up optimum i.e. every time i should not pull the dataframe data from DB ??
推荐答案
如果查找数据的大小较小,则可以创建Map
和broadcast
. broadcasted map
可以在udf中轻松使用,如下所示-
If lookup data is of small size then you can create Map
and broadcast
it. broadcasted map
can be easily used in udf as below-
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14),
("23", "score", "school", 11 ,14),
("24", "rate", "school", 12 ,12),
("25", "score", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
df.show
/**
* +---+-----+------+------+------+
* | id| code|entity|value1|value2|
* +---+-----+------+------+------+
* | 20|score|school| 14| 12|
* | 21|score|school| 13| 13|
* | 22| rate|school| 11| 14|
* | 23|score|school| 11| 14|
* | 24| rate|school| 12| 12|
* | 25|score|school| 11| 14|
* +---+-----+------+------+------+
*/
//this look up data populated from DB.
val ll = List(
("aaaa", 11),
("aaa", 12),
("aa", 13),
("a", 14)
)
val codeValudeDf = ll.toDF( "code", "value")
codeValudeDf.show
/**
* +----+-----+
* |code|value|
* +----+-----+
* |aaaa| 11|
* | aaa| 12|
* | aa| 13|
* | a| 14|
* +----+-----+
*/
broadcasted map
可以在udf中轻松使用,如下所示-
broadcasted map
can be easily used in udf as below-
val lookUp = spark.sparkContext
.broadcast(codeValudeDf.map{case Row(code: String, value: Integer) => value -> code}
.collect().toMap)
val look_up = udf((value: Integer) => lookUp.value.get(value))
df.withColumn("value1",
when($"code" === "score", look_up($"value1")).otherwise($"value1".cast("string")))
.withColumn("value2",
when($"code" === "score", look_up($"value2")).otherwise($"value2".cast("string")))
.show(false)
/**
* +---+-----+------+------+------+
* |id |code |entity|value1|value2|
* +---+-----+------+------+------+
* |20 |score|school|a |aaa |
* |21 |score|school|aa |aa |
* |22 |rate |school|11 |14 |
* |23 |score|school|aaaa |a |
* |24 |rate |school|12 |12 |
* |25 |score|school|aaaa |a |
* +---+-----+------+------+------+
*/
这篇关于数据框查找和优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!