如何在pyspark中使用具有许多条件的join? [英] How to use join with many conditions in pyspark?

查看：47 发布时间：2021/11/14 21:47:41 python apache-spark spark-dataframe

本文介绍了如何在pyspark中使用具有许多条件的join?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以使用带有单个条件的数据帧连接语句(在 pyspark 中)但是，如果我尝试添加多个条件，则它失败了.

I am able to use the dataframe join statement with single on condition ( in pyspark) But, if I try to add multiple conditions, then It is failing.

代码:

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

以上代码有效.但是，如果我为列表添加其他条件，例如 summary.bucket == 9 或其他内容，则会失败.请帮我解决这个问题.

The above code works. However If I add some other condition for list like, summary.bucket == 9 or something, it fails. Please help me fix this issue.

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

添加完整的工作示例.

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

想要加入:

category_id 和bucket 列，我想替换bucket_summary 上的county_prop 的值.

category_id and bucket columns, I want to replace the values of county_prop on bucket_summary.

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter")

   1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.

   2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"]

它不起作用.2 语句有什么问题?

It is not working. What can go wrong with the 2 statement?

推荐答案

例如

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

但在您的情况下，(summary.bucket)==9 不应显示为加入条件

But in your case, (summary.bucket)==9 should not appear as join condition

更新:

在连接条件中，您可以使用Column join expression的列表或Column/column_name的列表

In join condition you can use a list of Column join expression or a list of Column / column_name

这篇关于如何在pyspark中使用具有许多条件的join?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在pyspark中使用具有许多条件的join? [英] How to use join with many conditions in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在pyspark中使用具有许多条件的join? [英] How to use join with many conditions in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭