如何在pyspark中使用具有许多条件的join? [英] How to use join with many conditions in pyspark?
问题描述
我能够将dataframe join语句与on条件一起使用(在pyspark中),但是,如果我尝试添加多个条件,那么它将失败.
I am able to use the dataframe join statement with single on condition ( in pyspark) But, if I try to add multiple conditions, then It is failing.
代码:
summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").
上面的代码有效.但是,如果我为列表添加其他一些条件,例如summary.bucket == 9之类的东西,它将失败.请帮助我解决此问题.
The above code works. However If I add some other condition for list like, summary.bucket == 9 or something, it fails. Please help me fix this issue.
The error for the statement
summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")
ERROR : TypeError: 'Column' object is not callable
添加完整的工作示例.
schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
bucket_summary = sqlContext.createDataFrame([],schema)
temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
bucket_summary = bucket_summary.unionAll(temp_county_prop)
county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)
想加入:
category_id和bucket列,我想替换bucket_summary上的county_prop的值.
category_id and bucket columns, I want to replace the values of county_prop on bucket_summary.
cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]
bucket_summary2 = bucket_summary.join(county_prop,cond,how ="leftouter")
bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter")
1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"] --- THis too works.
2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"]
它不起作用.2语句有什么问题?
It is not working. What can go wrong with the 2 statement?
推荐答案
例如
df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')
但是对于您而言,(summary.bucket)== 9
不应显示为加入条件
But in your case, (summary.bucket)==9
should not appear as join condition
更新:
在联接条件中,您可以使用列联接表达式
或列表的 Column/column_name
In join condition you can use a list of Column join expression
or a list of Column / column_name
这篇关于如何在pyspark中使用具有许多条件的join?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!