Spark join 产生错误的结果 [英] Spark join produces wrong results

查看:70
本文介绍了Spark join 产生错误的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在可能提交错误之前在此处展示.我使用的是 Spark 1.6.0.

Presenting here before possibly filing a bug. I'm using Spark 1.6.0.

这是我正在处理的问题的简化版本.我过滤了一个表,然后我尝试对该子集和主表进行左外连接,匹配所有列.

This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.

我在主表中只有 2 行,在过滤表中只有 1 行.我期望结果表中只有来自子集的单行.

I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.

scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]

scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]

scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb|  c|  a|  b|  c|
+-----+-----+---+---+---+---+
|    a|    b|  1|  a|  b|  1|
|    a|    b|  1|  a|  b|  2|
+-----+-----+---+---+---+---+

我完全没想到会是这样的结果.我期待第一行,但不是第二行.我怀疑这是 null-safe 等式,所以我没有尝试.

I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.

scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb|  c|  a|  b|  c|
+-----+-----+---+---+---+---+
|    a|    b|  1|  a|  b|  1|
+-----+-----+---+---+---+---+

好的,这就是我期望的结果,但后来我对警告产生了怀疑.这里有一个单独的 StackOverflow 问题来处理该警告:SparkSQL 执行 Carthesian 连接而不是内部连接

OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join

所以我创建了一个避免警告的新列.

So I create a new column that avoids the warning.

scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb|  c|newc|  a|  b|  c|
+-----+-----+---+----+---+---+---+
|    a|    b|  1|   1|  a|  b|  1|
|    a|    b|  1|   1|  a|  b|  2|
+-----+-----+---+----+---+---+---+

但是现在结果又错了!我有很多空安全等式检查,并且警告不是致命的,所以我没有看到处理/解决这个问题的明确途径.

But now the result is wrong again! I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.

该行为是错误还是预期行为?如果可以,为什么?

Is the behaviour a bug, or is this expected behaviour? If expected, why?

推荐答案

如果您想要预期的行为,请在名称上使用 join :

If you want an expected behavior use either join on names:

val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")

a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// |  a|  b|  c|
// +---+---+---+
// |  a|  b|  1|
// +---+---+---+

或别名:

val aa = a.alias("a")
val bb = b.alias("b")

aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")

你也可以使用<=>:

aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")

据我所知,简单相等的特殊情况已经有一段时间了.这就是为什么尽管有警告但您仍能获得正确结果的原因.

As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.

第二种行为看起来确实是一个与数据中仍有 a.c 相关的错误.看起来它是在 b.c 之前被下游选择的,并且评估条件实际上是 a.newc = a.c.

The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.

val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")

这篇关于Spark join 产生错误的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆