基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框 [英] Subset dataframe based on matching values in another dataframe Pyspark 1.6.1

查看：67 发布时间：2020/9/4 20:41:20 pyspark spark-dataframe

本文介绍了基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数据框.第一个数据帧仅包含一个列business_contact_nr，该列是一组客户端编号.

I have two dataframes. The first dataframe contains just one column business_contact_nr, which is a set of client numbers.

| business_contact_nr |
34567
45678

第二个数据帧包含多个列，bc包含客户端编号，其他列包含有关这些客户端的信息.

The second dataframe contains multiple columns, bc containing client numbers and the other columns contain information about these clients.

| bc     | gender  | savings | month |
34567     1         100       200512
34567     1         200       200601
45678     0         500       200512
45678     0         500       200601
01234     1         60        200512
01234     1         150       200601

我想做的是根据第二个数据帧中的客户编号是否与第一个数据帧中的客户编号匹配，对第二个数据帧进行子集化.

What I would like to do is subset the second dataframe based on if the client numbers in it match with the ones in the first dataframe.

因此，所有不在第一个数据帧中的客户端编号也应删除，在这种情况下，所有行均位于bc = 01234.

So all client numbers that are not also in the first dataframe should be deleted, in this case all rows where bc = 01234.

我正在使用Pyspark 1.6.1.关于如何执行此操作的任何想法?

I am working with Pyspark 1.6.1. Any idea on how to do this?

推荐答案

这可以通过join解决.假设df1是您的第一个数据帧，而df2是您的第二个数据帧.然后，您可以先重命名df1.business_contact_nr和join:

This can be solved by join. Assume df1 is your first dataframe and df2 is your second dataframe. Then you can first rename df1.business_contact_nr and join:

df1 = df1.withColumnRenamed('business_contact_nr', 'bc')
df2subset = df2.join(df1, on='bc')

这篇关于基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框 [英] Subset dataframe based on matching values in another dataframe Pyspark 1.6.1

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于另一个数据框Pyspark 1.6.1中的匹配值的子集数据框 [英] Subset dataframe based on matching values in another dataframe Pyspark 1.6.1

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭