基于另一个数据帧 Pyspark 1.6.1 中匹配值的子集数据帧 [英] Subset dataframe based on matching values in another dataframe Pyspark 1.6.1
问题描述
我有两个数据框.第一个数据框只包含一列 business_contact_nr
,这是一组客户编号.
I have two dataframes. The first dataframe contains just one column business_contact_nr
, which is a set of client numbers.
| business_contact_nr |
34567
45678
第二个数据框包含多列,bc
包含客户端编号,其他列包含有关这些客户端的信息.
The second dataframe contains multiple columns, bc
containing client numbers and the other columns contain information about these clients.
| bc | gender | savings | month |
34567 1 100 200512
34567 1 200 200601
45678 0 500 200512
45678 0 500 200601
01234 1 60 200512
01234 1 150 200601
我想做的是根据第二个数据帧中的客户端编号是否与第一个数据帧中的客户端编号匹配来对第二个数据帧进行子集化.
What I would like to do is subset the second dataframe based on if the client numbers in it match with the ones in the first dataframe.
因此,所有不在第一个数据帧中的客户端编号都应该被删除,在这种情况下,bc = 01234
的所有行.
So all client numbers that are not also in the first dataframe should be deleted, in this case all rows where bc = 01234
.
我正在使用 Pyspark 1.6.1.知道如何做到这一点吗?
I am working with Pyspark 1.6.1. Any idea on how to do this?
推荐答案
这个可以通过join
解决.假设 df1
是您的第一个数据帧,而 df2
是您的第二个数据帧.那么你可以先重命名df1.business_contact_nr
和join
:
This can be solved by join
. Assume df1
is your first dataframe and df2
is your second dataframe. Then you can first rename df1.business_contact_nr
and join
:
df1 = df1.withColumnRenamed('business_contact_nr', 'bc')
df2subset = df2.join(df1, on='bc')
这篇关于基于另一个数据帧 Pyspark 1.6.1 中匹配值的子集数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!