pyspark加入多个条件 [英] pyspark join multiple conditions

查看:40
本文介绍了pyspark加入多个条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何指定很多条件pyspark 当我使用 .join()

示例:与蜂巢:

query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \加入 rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE )\

但在 PySpark 中我不知道如何制作,因为以下原因:

df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner')

它不起作用!!

解决方案

引自 spark 文档:

(https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join)

<块引用>

join(other, on=None, how=None) 加入另一个 DataFrame,使用给定的连接表达式.

以下执行 df1 和 df2 之间的完整外连接.

参数:other - 连接的右侧 on - 用于连接的字符串列名、列名列表、连接表达式 (Column) 或列列表.如果 on 是一个字符串或一个字符串列表,表示连接列的名称,列必须存在于两侧,这将执行内部等值连接.how – str,默认为内部".一内部、外部、left_outer、right_outer、semijoin.

<预><代码>>>>df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]>>>cond = [df.name == df3.name, df.age == df3.age]>>>df.join(df3, cond, 'outer').select(df.name, df3.age).collect()[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

所以你需要像上一个例子一样使用条件作为列表"选项.

How I can specify lot of conditions in pyspark when I use .join()

Example : with hive :

query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \
    join rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE  )\

But in PySpark I don't know how to make it because the following:

df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner')

It does not work!!

解决方案

Quoting from spark docs:

(https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join)

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression.

The following performs a full outer join between df1 and df2.

Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an inner equi-join. how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, semijoin.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
 [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]


>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

So you need to use the "condition as a list" option like in the last example.

这篇关于pyspark加入多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆