星火SQL表演carthesian加入,而不是内部联接 [英] Spark SQL performing carthesian join instead of inner join

查看:727
本文介绍了星火SQL表演carthesian加入,而不是内部联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想经过一些执行一些早期的计算连接两个DataFrames对方。命令很简单:

I am trying to join two DataFrames with each other after some performing some earlier computation. The command is simple:

    employee.join(employer, employee("id") === employer("id"))

然而,加入似乎工作carthesian加入,完全无视我的===声明。有没有人有一个想法,为什么会出现这种情况?

However, the join seems to perform carthesian join, completely ignoring my === statement. Does anyone has an idea why is this happening?

推荐答案

我觉得我有同样的问题斗争。检查是否有一个警告:

I think I fought with the same issue. Check if you have a warning:

Constructing trivially true equals predicate [..]

创建连接操作后。如果是这样,只是别名之一,无论是雇员或雇主列的数据帧,例如像这样的:

After creating the join operation. If so, just alias one of the columns in either employee or employer DataFrame, e.g. like this:

employee.select(<columns you want>, employee("id").as("id_e"))

然后执行上加入员工(id_e)===雇主(ID)

解释。
看看这个操作流程:

Explanation. Look at this operation flow:

如果您直接使用您的数据帧A到计算数据框中B和一起加入他们的列ID,这是从数据帧A,你将不会被执行联接你想做的事情。从DataFrameB ID列实际上是从DataFrameA完全相同的列,因此火花只会断言列是与自己平等的,因此真正的平凡predicate。
 为了避免这种情况,你有别名的一列,使他们出现火花​​不同的列。现在仅警告消息已经以这种方式实现的:

If you directly use your DataFrame A to compute DataFrame B and join them together on the column Id, which comes from the DataFrame A, you will not be performing the join you want to do. The ID column from DataFrameB is in fact the exactly same column from the DataFrameA, so spark will just assert that the column is equal with itself and hence the trivially true predicate. To avoid this, you have to alias one of the columns so that they will appear as "different" columns for spark. For now only the warning message has been implemented in this way:

    def === (other: Any): Column = {
    val right = lit(other).expr
    if (this.expr == right) {
      logWarning(
        s"Constructing trivially true equals predicate, '${this.expr} = $right'. " +
          "Perhaps you need to use aliases.")
    }
    EqualTo(expr, right)
  }

这不是我一个很好的解决方案,解决方案(它是很容易错过警告消息),我希望这也会莫名其妙地被固定。

It is not a very good solution solution for me (it is really easy to miss the warning message), I hope this will be somehow fixed.

您是幸运的,虽然看到警告信息,<一个href=\"http://mail-archives.apache.org/mod_mbox/spark-commits/201503.mbox/%3C9f97c70720564d39afce3a5d93027f70@git.apache.org%3E\">it已加入不久前。)

You are lucky though to see the warning message, it has been added not so long ago ;).

这篇关于星火SQL表演carthesian加入,而不是内部联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆