Spark SQL 执行 Carthesian 连接而不是内部连接 [英] Spark SQL performing carthesian join instead of inner join

查看:25
本文介绍了Spark SQL 执行 Carthesian 连接而不是内部连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在执行一些较早的计算之后,我试图将两个 DataFrame 相互连接起来.命令很简单:

I am trying to join two DataFrames with each other after some performing some earlier computation. The command is simple:

    employee.join(employer, employee("id") === employer("id"))

然而,join 似乎执行了 carthesian join,完全忽略了我的 === 语句.有谁知道为什么会这样?

However, the join seems to perform carthesian join, completely ignoring my === statement. Does anyone has an idea why is this happening?

推荐答案

我想我也遇到了同样的问题.检查您是否有警告:

I think I fought with the same issue. Check if you have a warning:

Constructing trivially true equals predicate [..]

创建连接操作后.如果是这样,只需在员工或雇主 DataFrame 中为其中一列添加别名,例如像这样:

After creating the join operation. If so, just alias one of the columns in either employee or employer DataFrame, e.g. like this:

employee.select(<columns you want>, employee("id").as("id_e"))

然后对employee("id_e") ===雇主("id")进行join.

说明.看看这个操作流程:

如果您直接使用 DataFrame A 来计算 DataFrame B 并在来自 DataFrame A 的列 Id 上将它们连接在一起,您将不会执行您想要执行的连接.DataFrameB 中的 ID 列实际上与 DataFrameA 中的列完全相同,因此 spark 只会断言该列与其自身相等,因此是微不足道的谓词.为避免这种情况,您必须为其中一列添加别名,以便它们在 spark 中显示为不同"列.目前仅以这种方式实现了警告消息:

If you directly use your DataFrame A to compute DataFrame B and join them together on the column Id, which comes from the DataFrame A, you will not be performing the join you want to do. The ID column from DataFrameB is in fact the exactly same column from the DataFrameA, so spark will just assert that the column is equal with itself and hence the trivially true predicate. To avoid this, you have to alias one of the columns so that they will appear as "different" columns for spark. For now only the warning message has been implemented in this way:

    def === (other: Any): Column = {
    val right = lit(other).expr
    if (this.expr == right) {
      logWarning(
        s"Constructing trivially true equals predicate, '${this.expr} = $right'. " +
          "Perhaps you need to use aliases.")
    }
    EqualTo(expr, right)
  }

这对我来说不是一个很好的解决方案(很容易错过警告消息),我希望这能以某种方式修复.

It is not a very good solution solution for me (it is really easy to miss the warning message), I hope this will be somehow fixed.

你很幸运看到了警告信息,它是不久前添加的 ;).

You are lucky though to see the warning message, it has been added not so long ago ;).

这篇关于Spark SQL 执行 Carthesian 连接而不是内部连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆