PySpark DataFrame-动态连接多个列 [英] PySpark DataFrame - Join on multiple columns dynamically

查看:645
本文介绍了PySpark DataFrame-动态连接多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在Spark上有两个DataFrames

let's say I have two DataFrames on Spark

firstdf = sqlContext.createDataFrame([{'firstdf-id':1,'firstdf-column1':2,'firstdf-column2':3,'firstdf-column3':4}, \
{'firstdf-id':2,'firstdf-column1':3,'firstdf-column2':4,'firstdf-column3':5}])

seconddf = sqlContext.createDataFrame([{'seconddf-id':1,'seconddf-column1':2,'seconddf-column2':4,'seconddf-column3':5}, \
{'seconddf-id':2,'seconddf-column1':6,'seconddf-column2':7,'seconddf-column3':8}])

现在我想通过多列(任何大于一的列)将它们连接起来

Now I want to join them by multiple columns (any number bigger than one)

我所拥有的是第一个DataFrame的列数组和第二个DataFrame的列数组,这些数组具有相同的大小,我想通过这些数组中指定的列进行联接.例如:

What I have is an array of columns of the first DataFrame and an array of columns of the second DataFrame, these arrays have the same size, and I want to join by the columns specified in these arrays. For example:

columnsFirstDf = ['firstdf-id', 'firstdf-column1']
columnsSecondDf = ['seconddf-id', 'seconddf-column1']

由于这些数组的大小可变,因此我不能使用这种方法:

Since these arrays have variable sizes I can't use this kind of approach:

from pyspark.sql.functions import *

firstdf.join(seconddf, \
    (col(columnsFirstDf[0]) == col(columnsSecondDf[0])) &
    (col(columnsFirstDf[1]) == col(columnsSecondDf[1])), \
    'inner'
)

有什么办法可以动态地加入多个列?

Is there any way that I can join on multiple columns dynamically?

推荐答案

为什么不使用简单的理解:

Why not use a simple comprehension:

firstdf.join(
    seconddf, 
   [col(f) == col(s) for (f, s) in zip(columnsFirstDf, columnsSecondDf)], 
   "inner"
)

由于您使用逻辑,因此无需&运算符就可以提供条件列表.

Since you use logical it is enough to provide a list of conditions without & operator.

这篇关于PySpark DataFrame-动态连接多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆