连接后如何删除具有非空值的重复列? [英] How to remove duplicate columns with their non-null value after join?
问题描述
我有一个名为"A"的数据框,其中有300多个列,并且我正尝试将名为"A"的数据框及其增量数据"B"与具有与"A"相同的架构.
加入数据框后,我得到重复的列.我通过使用co
来避免val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns
我正在调用UDF以在重复的列中选择最新的值(从增量文件中). 增量数据将几乎没有更新数据,我需要将其与增量数据帧中的所有新数据以及数据帧"B"中的旧数据一起添加.
还有其他方法可以避免单独选择列并对其使用for循环. 还是有任何方法可以在加入后获得增量df的新值/更新值以及数据框"A"中不存在的数据框"B"中的旧值.
我首先要使用 join
运算符的单字符串usingColumn
参数.
df1.join(df2, "PERIOD_TAG", "fullouter")
这将消除对PERIOD_TAG
列的重复数据删除.
与其他联接函数不同,联接列将仅在输出中出现一次,即类似于SQL的JOIN USING语法.
最后一步是使用合并功能:
coalesce(e:Column *):列返回不是
null
的第一列,如果所有输入均为null
,则返回null
.
看起来完全像您的情况,避免处理300多个列.
val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
select(myCol).
show
因此,练习是使用coalesce
函数为架构中的每个列构建类似myCol
的列序列(看起来就像一个相当容易的编程任务:))>
I have a dataframe named "A" with 300+ columns in it and i am trying to join the dataframe named "A" with its incremental data "B" with same schema as "A".
After joining the dataframes, i am getting duplicate columns. That i was avoiding by using co
val toPrint = udf((value1: String, value2: String) => if(value1 != null) {value1} else value2)
val dfClean = df1.join(df2, df1("PERIOD_TAG") === df2("PERIOD_TAG"), "fullouter").select(toPrint(df1("PERIOD_SHORT_DESCRIPTION"),df2("PERIOD_SHORT_DESCRIPTION")).alias("PERIOD_SHORT_DESCRIPTION"),toPrint(df1("PERIOD_TAG"),df2("PERIOD_TAG")).alias("PERIOD_TAG"))....so on for all the columns
I am calling a UDF to select the most updated value(from incremental file) among the duplicate columns. The incremental data will have few updated data which i need to add along with all new data in incremantal dataframe and also old data of dataframe "B".
Is there any another way to avoid selecting columns individually and use a for loop for it. Or is there any way that after joining, i get the new/updated value of my incremental df and old values of dataframe "B" which are not present in dataframe "A".
I'd first avoid the duplication in join column names using single-string usingColumn
argument of join
operator.
df1.join(df2, "PERIOD_TAG", "fullouter")
That takes care of de-duplicating PERIOD_TAG
column.
Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
The last step is to use coalesce function:
coalesce(e: Column*): Column Returns the first column that is not
null
, ornull
if all inputs arenull
.
That looks like your case exactly and avoids dealing with 300+ columns.
val myCol = coalesce($"df1.one", $"df2.one") as "one"
df1.join(df2, "PERIOD_TAG", "inner").
select(myCol).
show
So, the exercise is to build myCol
-like sequence of columns using coalesce
function for every column in the schema (which looks like a fairly easy programming assignment :))
这篇关于连接后如何删除具有非空值的重复列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!