加入后如何避免重复列? [英] How to avoid duplicate columns after join?

查看:22
本文介绍了加入后如何避免重复列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个包含以下列的数据框:

I have two dataframes with the following columns:

df1.columns
//  Array(ts, id, X1, X2)

df2.columns
//  Array(ts, id, Y1, Y2)

在我做完之后

val df_combined = df1.join(df2, Seq(ts,id))

我最终得到以下列:Array(ts, id, X1, X2, ts, id, Y1, Y2).我可以预期公共列会被删除.有什么额外的事情需要做吗?

I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?

推荐答案

简单的答案(来自 Databricks 对此问题的常见问题解答) 是执行连接,其中连接的列表示为字符串数组(或一个字符串)而不是谓词.

The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.

以下是改编自 Databricks 常见问题解答的示例,但有两个连接列以回答原始发布者的问题.

Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.

这是左侧数据框:

val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))

val left = llist.toDF("firstname","lastname","date","duration")

left.show()

/*
+---------+--------+----------+--------+
|firstname|lastname|      date|duration|
+---------+--------+----------+--------+
|      bob|       b|2015-01-13|       4|
|    alice|       a|2015-04-23|      10|
+---------+--------+----------+--------+
*/

这是正确的数据框:

val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")

right.show()

/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
|    alice|       a|   100|
|      bob|       b|    23|
+---------+--------+------+
*/

这是一个不正确的解决方案,其中连接列被定义为谓词left("firstname")===right("firstname") &&left("lastname")====right("lastname").

Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").

错误的结果是 firstnamelastname 列在连接的数据框中重复:

The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:

left.join(right, left("firstname")===right("firstname") &&
                 left("lastname")===right("lastname")).show

/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname|      date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
|      bob|       b|2015-01-13|       4|      bob|       b|    23|
|    alice|       a|2015-04-23|      10|    alice|       a|   100|
+---------+--------+----------+--------+---------+--------+------+
*/

正确的解决方案是将连接列定义为字符串数组Seq("firstname", "lastname").输出数据框没有重复的列:

The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:

left.join(right, Seq("firstname", "lastname")).show

/*
+---------+--------+----------+--------+------+
|firstname|lastname|      date|duration|upload|
+---------+--------+----------+--------+------+
|      bob|       b|2015-01-13|       4|    23|
|    alice|       a|2015-04-23|      10|   100|
+---------+--------+----------+--------+------+
*/

这篇关于加入后如何避免重复列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆