在 Spark 中进行 DF 连接后删除重复列 [英] Removing duplicate columns after a DF join in Spark

查看:48
本文介绍了在 Spark 中进行 DF 连接后删除重复列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当你加入两个具有相似列名的 DF 时:

When you join two DFs with similar column names:

df = df1.join(df2, df1['id'] == df2['id'])

Join 工作正常,但你不能调用 id 列,因为它不明确,你会得到以下异常:

Join works fine but you can't call the id column because it is ambiguous and you would get the following exception:

pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous, could be: id#5691, id#5918.;"

这使得 id 不再可用...

This makes id not usable anymore...

下面的函数解决了这个问题:

The following function solves the problem:

def join(df1, df2, cond, how='left'):
    df = df1.join(df2, cond, how=how)
    repeated_columns = [c for c in df1.columns if c in df2.columns]
    for col in repeated_columns:
        df = df.drop(df2[col])
    return df

我不喜欢它的一点是我必须遍历列名并逐个删除它们.这看起来很笨重...

What I don't like about it is that I have to iterate over the column names and delete them why by one. This looks really clunky...

您是否知道任何其他解决方案可以更优雅地连接和删除重复项,或者删除多个列而不遍历每个列?

Do you know of any other solution that will either join and remove duplicates more elegantly or delete multiple columns without iterating over each of them?

推荐答案

如果两个数据框的连接列名称相同,而您只需要等连接,您可以将连接列指定为列表,在这种情况下结果将只保留连接列之一:

If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:

df1.show()
+---+----+
| id|val1|
+---+----+
|  1|   2|
|  2|   3|
|  4|   4|
|  5|   5|
+---+----+

df2.show()
+---+----+
| id|val2|
+---+----+
|  1|   2|
|  1|   3|
|  2|   4|
|  3|   5|
+---+----+

df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|
+---+----+----+

<小时>

否则你需要给连接数据框别名,然后用别名引用重复的列:


Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:

df1.alias("a").join(
    df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|
+---+----+----+

这篇关于在 Spark 中进行 DF 连接后删除重复列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆