Spark 2.3:减去数据帧但保留重复值(Scala) [英] Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

查看:25
本文介绍了Spark 2.3:减去数据帧但保留重复值(Scala)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

复制来自这个问题的示例:作为概念示例,如果我有两个数据框:

Copying example from this question: As a conceptual example, if I have two dataframes:

words     = [the, quick, fox, a, brown, fox]
stopWords = [the, a]

然后我希望以任何顺序输出:

then I want the output to be, in any order:

words - stopWords = [quick, brown, fox, fox]

ExceptAll 可以在 2.4 中做到这一点,但我无法升级.链接问题中的答案特定于数据帧:

ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:

words.join(stopwords, words("id") === stopwords("id"), "left_outer")
     .where(stopwords("id").isNull)
     .select(words("id")).show()

因为您需要知道 pkey 和其他列.

as in you need to know the pkey and the other columns.

谁能想出一个适用于任何数据框的答案?

Can anyone come up with an answer that will work on any dataframe?

推荐答案

事实证明这样做更容易 df1.except(df2) 然后用 df1 加入结果获取所有重复项.

Turns out it's easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.

完整代码:

def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
    val except = df1.except(df2)

    val columns = df1.columns
    val colExpr: Column = df1(columns.head) <=> except(columns.head)
    val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
        colExpr && df1(p) <=> except(p)
    }

    val join = df1.join(except, joinExpression, "inner")

    join.select(df1("*"))
}

这篇关于Spark 2.3:减去数据帧但保留重复值(Scala)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆