pyspark中的左外部连接后删除功能不起作用 [英] Drop function not working after left outer join in pyspark

查看:61
本文介绍了pyspark中的左外部连接后删除功能不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的pyspark版本是2.1.1.我试图加入两个具有两个列 id priority 的数据框(左外侧).我正在创建这样的数据框:

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:

a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)

b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)

c_df 模式作为 DataFrame [uid:int,priority:int,uid:int,priority:int]

放置功能未删除列.

但是,如果我尝试这样做:

But if I try to do:

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)

然后删除a_df的优先级列.

Then priority column for a_df gets dropped.

不确定是否存在版本更改问题或其他问题,但是drop函数会像这样感觉很奇怪.

Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.

我知道解决方法可以是先删除不需要的列,然后再进行联接.但是仍然不确定为什么drop功能不能正常工作吗?

I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?

谢谢.

推荐答案

与pyspark中的连接重复的列名称会导致无法预料的行为,在连接之前,我已阅读过消除歧义的名称.从stackoverflow中, Spark Dataframe区分具有重复名称的列 Pyspark Join,然后列选择显示了意外的输出.很抱歉,我找不到您所描述的为什么 pyspark无法正常工作.

Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.

但是databricks文档解决了这个问题: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

来自数据块:

如果您在Spark中执行联接但未正确指定联接,则会得到重复的列名.这使选择这些列变得更加困难.本主题和笔记本演示了如何执行联接,以便您没有重复的列.

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.

加入时,您可以尝试使用别名(通常是我使用的别名),也可以将列作为 list 类型或 str .

When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.

df = left.join(right, ["priority"]) 

这篇关于pyspark中的左外部连接后删除功能不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆