加入后如何避免重复的列? [英] How to avoid duplicate columns after join?

查看:130
本文介绍了加入后如何避免重复的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个带有以下列的数据框:

I have two dataframes with the following columns:

df1.columns
//  Array(ts, id, X1, X2)

df2.columns
//  Array(ts, id, Y1, Y2)

我做完

val df_combined = df1.join(df2, Seq(ts,id))

我最后列出了以下几列:Array(ts, id, X1, X2, ts, id, Y1, Y2).我可以预期公共列会被删除.还有其他需要做的事情吗?

I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?

推荐答案

简单答案(来自

The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.

下面是一个根据Databricks FAQ改编而成的示例,但有两个连接列,用于回答原始发布者的问题.

Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.

这是数据框:

val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))

val left = llist.toDF("firstname","lastname","date","duration")

left.show()

/*
+---------+--------+----------+--------+
|firstname|lastname|      date|duration|
+---------+--------+----------+--------+
|      bob|       b|2015-01-13|       4|
|    alice|       a|2015-04-23|      10|
+---------+--------+----------+--------+
*/

这是数据框:

val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")

right.show()

/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
|    alice|       a|   100|
|      bob|       b|    23|
+---------+--------+------+
*/

这是一个不正确解决方案,其中连接列被定义为谓词left("firstname")===right("firstname") && left("lastname")===right("lastname").

Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").

错误的结果是在连接的数据框中重复了firstnamelastname列:

The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:

left.join(right, left("firstname")===right("firstname") &&
                 left("lastname")===right("lastname")).show

/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname|      date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
|      bob|       b|2015-01-13|       4|      bob|       b|    23|
|    alice|       a|2015-04-23|      10|    alice|       a|   100|
+---------+--------+----------+--------+---------+--------+------+
*/

正确解决方案是将联接列定义为字符串Seq("firstname", "lastname")的数组.输出数据框没有重复的列:

The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:

left.join(right, Seq("firstname", "lastname")).show

/*
+---------+--------+----------+--------+------+
|firstname|lastname|      date|duration|upload|
+---------+--------+----------+--------+------+
|      bob|       b|2015-01-13|       4|    23|
|    alice|       a|2015-04-23|      10|   100|
+---------+--------+----------+--------+------+
*/

这篇关于加入后如何避免重复的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆