PySpark DataFrame - 动态加入多列 [英] PySpark DataFrame - Join on multiple columns dynamically
问题描述
假设我在 Spark 上有两个 DataFrame
let's say I have two DataFrames on Spark
firstdf = sqlContext.createDataFrame([{'firstdf-id':1,'firstdf-column1':2,'firstdf-column2':3,'firstdf-column3':4}, \
{'firstdf-id':2,'firstdf-column1':3,'firstdf-column2':4,'firstdf-column3':5}])
seconddf = sqlContext.createDataFrame([{'seconddf-id':1,'seconddf-column1':2,'seconddf-column2':4,'seconddf-column3':5}, \
{'seconddf-id':2,'seconddf-column1':6,'seconddf-column2':7,'seconddf-column3':8}])
现在我想通过多列(任何大于一的数字)加入它们
Now I want to join them by multiple columns (any number bigger than one)
我拥有的是第一个 DataFrame 的列数组和第二个 DataFrame 的列数组,这些数组具有相同的大小,我想通过这些数组中指定的列加入.例如:
What I have is an array of columns of the first DataFrame and an array of columns of the second DataFrame, these arrays have the same size, and I want to join by the columns specified in these arrays. For example:
columnsFirstDf = ['firstdf-id', 'firstdf-column1']
columnsSecondDf = ['seconddf-id', 'seconddf-column1']
由于这些数组的大小可变,我不能使用这种方法:
Since these arrays have variable sizes I can't use this kind of approach:
from pyspark.sql.functions import *
firstdf.join(seconddf, \
(col(columnsFirstDf[0]) == col(columnsSecondDf[0])) &
(col(columnsFirstDf[1]) == col(columnsSecondDf[1])), \
'inner'
)
有什么办法可以动态加入多个列吗?
Is there any way that I can join on multiple columns dynamically?
推荐答案
为什么不使用简单的推导式:
Why not use a simple comprehension:
firstdf.join(
seconddf,
[col(f) == col(s) for (f, s) in zip(columnsFirstDf, columnsSecondDf)],
"inner"
)
因为你使用了逻辑,所以提供一个没有 &
操作符的条件列表就足够了.
Since you use logical it is enough to provide a list of conditions without &
operator.
这篇关于PySpark DataFrame - 动态加入多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!