动态连接多个列上的两个spark-scala数据帧,而无需对连接条件进行硬编码 [英] dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions
本文介绍了动态连接多个列上的两个spark-scala数据帧,而无需对连接条件进行硬编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想动态地将多个列上的两个spark-scala数据帧连接起来.我将避免对硬编码的列名进行比较,如以下语句所示;
I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
此查询的解决方案已存在于pyspark版本中-在以下链接中提供 PySpark DataFrame-动态加入多个列
The solution for this query already exists in pyspark version --provided in the following link PySpark DataFrame - Join on multiple columns dynamically
我想使用spark-scala编码相同的代码
I would like to code the same code using spark-scala
推荐答案
在scala中,您可以像在python中一样执行此操作,但是您需要使用map和reduce函数:
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)
这篇关于动态连接多个列上的两个spark-scala数据帧,而无需对连接条件进行硬编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文