连接两个数据帧,其中连接键不同,并且仅选择一些列 [英] Join two DataFrames where the join key is different and only select some columns
本文介绍了连接两个数据帧,其中连接键不同,并且仅选择一些列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想做的是:
使用它们各自的id
列a_id
和b_id
连接两个DataFrames A
和B
.我想从A
中选择所有列,并从B
Join two DataFrames A
and B
using their respective id
columns a_id
and b_id
. I want to select all columns from A
and two specific columns from B
我尝试了以下类似的操作,但使用了不同的引号,但仍然无法正常工作.我觉得在pyspark中,应该有一种简单的方法来做到这一点.
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
我知道你会写
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
要这样做,但我想更像上面的伪代码那样做.
to do this but I would like to do it more like the pseudo code above.
推荐答案
您的伪代码基本上是正确的.如果在两个DataFrame中都存在id
列,则可以对该版本进行稍作修改:
Your pseudocode is basically correct. This slightly modified version would work if the id
column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
查看全文