pyspark加入多个条件 [英] pyspark join multiple conditions
问题描述
我想问一下您是否对我如何指定很多条件有任何想法 我使用.join()时pyspark
I want ask if you have any idea about how i can specify lot of conditions in pyspark when i use .join()
示例: 与蜂巢:
query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \
join rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE )\
但是在pyspark中,我不知道如何做到这一点,因为以下原因:
But in pyspark I don't know how to make it because the following:
df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner')
不起作用!
推荐答案
spark文档报价:
Quoting from spark docs:
join(other,on = None,how = None)使用另一个连接到另一个DataFrame 给定连接表达式.
join(other, on=None, how=None) Joins with another DataFrame, using the given join expression.
以下代码在df1和df2之间执行完全外部联接.
The following performs a full outer join between df1 and df2.
参数:其他–连接的右侧–连接的字符串 列名,列名列表,连接表达式(列)或 列列表.如果on是一个字符串或字符串列表,表示 连接列的名称,该列必须在两侧都存在, 并执行内部等值连接.如何– str,默认为内部".一 内部,外部,left_outer,right_outer,半联接的.
Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an inner equi-join. how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, semijoin.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]
因此,您需要像上一个示例一样使用条件作为列表"选项.
So you need to use the "condition as a list" option like in the last example.
这篇关于pyspark加入多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!