基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框 [英] Subset dataframe based on matching values in another dataframe Pyspark 1.6.1
问题描述
我有两个数据框.第一个数据帧仅包含一个列business_contact_nr
,该列是一组客户端编号.
I have two dataframes. The first dataframe contains just one column business_contact_nr
, which is a set of client numbers.
| business_contact_nr |
34567
45678
第二个数据帧包含多个列,bc
包含客户端编号,其他列包含有关这些客户端的信息.
The second dataframe contains multiple columns, bc
containing client numbers and the other columns contain information about these clients.
| bc | gender | savings | month |
34567 1 100 200512
34567 1 200 200601
45678 0 500 200512
45678 0 500 200601
01234 1 60 200512
01234 1 150 200601
我想做的是根据第二个数据帧中的客户编号是否与第一个数据帧中的客户编号匹配,对第二个数据帧进行子集化.
What I would like to do is subset the second dataframe based on if the client numbers in it match with the ones in the first dataframe.
因此,所有不在第一个数据帧中的客户端编号也应删除,在这种情况下,所有行均位于bc = 01234
.
So all client numbers that are not also in the first dataframe should be deleted, in this case all rows where bc = 01234
.
我正在使用Pyspark 1.6.1.关于如何执行此操作的任何想法?
I am working with Pyspark 1.6.1. Any idea on how to do this?
推荐答案
这可以通过join
解决.假设df1
是您的第一个数据帧,而df2
是您的第二个数据帧.然后,您可以先重命名df1.business_contact_nr
和join
:
This can be solved by join
. Assume df1
is your first dataframe and df2
is your second dataframe. Then you can first rename df1.business_contact_nr
and join
:
df1 = df1.withColumnRenamed('business_contact_nr', 'bc')
df2subset = df2.join(df1, on='bc')
这篇关于基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!