连接后如何删除具有非空值的重复列? [英] How to remove duplicate columns with their non-null value after join?

查看：81 发布时间：2020/9/4 20:57:16 scala apache-spark apache-spark-sql

本文介绍了连接后如何删除具有非空值的重复列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个名为"A"的数据框，其中有300多个列，并且我正尝试将名为"A"的数据框及其增量数据"B"与具有与"A"相同的架构.

加入数据框后，我得到重复的列.我通过使用co

来避免

val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns

我正在调用UDF以在重复的列中选择最新的值(从增量文件中). 增量数据将几乎没有更新数据，我需要将其与增量数据帧中的所有新数据以及数据帧"B"中的旧数据一起添加.

还有其他方法可以避免单独选择列并对其使用for循环. 还是有任何方法可以在加入后获得增量df的新值/更新值以及数据框"A"中不存在的数据框"B"中的旧值.

解决方案

我首先要使用 join运算符的单字符串usingColumn参数.

df1.join(df2, "PERIOD_TAG", "fullouter")

这将消除对PERIOD_TAG列的重复数据删除.

与其他联接函数不同，联接列将仅在输出中出现一次，即类似于SQL的JOIN USING语法.

最后一步是使用合并功能:

coalesce(e:Column *):列返回不是null的第一列，如果所有输入均为null，则返回null.

看起来完全像您的情况，避免处理300多个列.

val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
  select(myCol).
  show

因此，练习是使用coalesce函数为架构中的每个列构建类似myCol的列序列(看起来就像一个相当容易的编程任务:))

I have a dataframe named "A" with 300+ columns in it and i am trying to join the dataframe named "A" with its incremental data "B" with same schema as "A".

After joining the dataframes, i am getting duplicate columns. That i was avoiding by using co

val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns

I am calling a UDF to select the most updated value(from incremental file) among the duplicate columns. The incremental data will have few updated data which i need to add along with all new data in incremantal dataframe and also old data of dataframe "B".

Is there any another way to avoid selecting columns individually and use a for loop for it. Or is there any way that after joining, i get the new/updated value of my incremental df and old values of dataframe "B" which are not present in dataframe "A".

解决方案

I'd first avoid the duplication in join column names using single-string usingColumn argument of join operator.

df1.join(df2, "PERIOD_TAG", "fullouter")

That takes care of de-duplicating PERIOD_TAG column.

Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

The last step is to use coalesce function:

coalesce(e: Column*): Column Returns the first column that is not null, or null if all inputs are null.

That looks like your case exactly and avoids dealing with 300+ columns.

val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
  select(myCol).
  show

So, the exercise is to build myCol-like sequence of columns using coalesce function for every column in the schema (which looks like a fairly easy programming assignment :))

这篇关于连接后如何删除具有非空值的重复列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

连接后如何删除具有非空值的重复列? [英] How to remove duplicate columns with their non-null value after join?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

连接后如何删除具有非空值的重复列? [英] How to remove duplicate columns with their non-null value after join?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭