如何在火花中处理 [英] how to handle this in spark
问题描述
我正在使用spark-sql 2.4.x版本,对于Cassandra-3.x版本使用datastax-spark-cassandra-connector.连同卡夫卡.
I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
我有一个来自kafka主题的财务数据场景.数据(基础数据集)包含companyId,year,prev_year字段信息.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
如果列year === prev_year,那么我需要加入不同的表,即exchange_rates.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
如果列year =!= prev_year,那么我需要返回基本数据集本身
If columns year =!= prev_year then I need to return the base dataset itself
如何在spark-sql中执行此操作?
How to do this in spark-sql ?
推荐答案
针对您的情况,您可以参考以下方法.
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+
这篇关于如何在火花中处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!