Spark 2.3:减去数据帧,但保留重复值(Scala) [英] Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

查看:120
本文介绍了Spark 2.3:减去数据帧,但保留重复值(Scala)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题的复制示例: 作为一个概念性示例,如果我有两个数据框:

Copying example from this question: As a conceptual example, if I have two dataframes:

words     = [the, quick, fox, a, brown, fox]
stopWords = [the, a]

然后我要以任意顺序输出:

then I want the output to be, in any order:

words - stopWords = [quick, brown, fox, fox]

ExceptAll可以在2.4中做到这一点,但我无法升级.链接问题中的答案特定于数据框:

ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:

words.join(stopwords, words("id") === stopwords("id"), "left_outer")
     .where(stopwords("id").isNull)
     .select(words("id")).show()

因为您需要了解pkey和其他列.

as in you need to know the pkey and the other columns.

任何人都可以提出对任何数据框都适用的答案吗?

Can anyone come up with an answer that will work on any dataframe?

推荐答案

事实证明,执行df1.except(df2)更加容易,然后将结果与df1结合起来即可得到所有重复项.

Turns out it's easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.

完整代码:

def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
    val except = df1.except(df2)

    val columns = df1.columns
    val colExpr: Column = df1(columns.head) <=> except(columns.head)
    val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
        colExpr && df1(p) <=> except(p)
    }

    val join = df1.join(except, joinExpression, "inner")

    join.select(df1("*"))
}

这篇关于Spark 2.3:减去数据帧,但保留重复值(Scala)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆