在pyspark中左外连接后删除功能不起作用 [英] Drop function not working after left outer join in pyspark

查看:24
本文介绍了在pyspark中左外连接后删除功能不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 pyspark 版本是 2.1.1.我正在尝试加入具有两列 idpriority 的两个数据帧(左外).我正在像这样创建我的数据框:

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:

a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)

b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)

c_df 架构作为 DataFrame[uid: int, priority: int, uid: int, priority: int]

drop 函数不会删除列.

The drop function is not removing the columns.

但如果我尝试这样做:

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)

然后 a_df 的优先级列被删除.

Then priority column for a_df gets dropped.

不确定是否有版本更改问题或其他问题,但感觉很奇怪 drop 函数的行为会像这样.

Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.

我知道解决方法可以是先删除不需要的列,然后再进行连接.但仍然不知道为什么 drop 功能不起作用?

I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?

提前致谢.

推荐答案

在 pyspark 中使用联接的重复列名称会导致不可预测的行为,我已阅读以在联接之前消除名称的歧义.从 stackoverflow,Spark Dataframe 区分具有重复名称的列Pyspark Join 然后列选择显示意外输出.很抱歉,我找不到为什么 pyspark 不能像您描述的那样工作.

Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.

但是 databricks 文档解决了这个问题:https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

来自数据块:

如果您在 Spark 中执行连接并且没有正确指定您的连接,您最终会得到重复的列名.这使得选择这些列变得更加困难.本主题和笔记本演示了如何执行联接以避免出现重复的列.

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.

当您加入时,您可以尝试使用 alias(这通常是我使用的),或者您可以将列作为 list 类型或 str.

When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.

df = left.join(right, ["priority"]) 

这篇关于在pyspark中左外连接后删除功能不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆