星火连接生成错误的结果 [英] Spark join produces wrong results
问题描述
presenting可能在这里提交错误之前。我使用的是星火1.6.0。
Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
这是我处理这个问题的一个简化版本。我筛选的表,然后我试图做一个左外与子集和主表连接,匹配的所有列。
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
我只有在主表2的行和一个在过滤表中。我期待的结果表只具有子集单列。
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
我没想到,结果都没有。我预计第一排,而不是第二。我怀疑它是空安全的平等,所以我想它没有。
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK,这就是结果,我期望,但后来我得到了可疑的警告。有一个单独的StackOverflow问题来处理这里的警告:<一href=\"http://stackoverflow.com/questions/32190828/spark-sql-performing-carthesian-join-instead-of-inner-join\">Spark SQL表演carthesian加入,而不是内部联接
所以我创建避免了警告。新的列
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
但现在的结果是又错了!
我有很多空安全的平等检查,并警告不是致命的,所以我不认为与工作/解决此一条清晰的路径。
But now the result is wrong again! I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
时的行为的错误,或者这是预期的行为?如果预期的,为什么?
Is the behaviour a bug, or is this expected behaviour? If expected, why?
推荐答案
如果你想要一个预期的行为,无论是使用加入
上的名称:
If you want an expected behavior use either join
on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
或别名:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
您可以使用&LT; = GT;
以及
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
据我记得有一直为一个虽然简单平等的一个特例。这就是为什么你不顾警告得到正确的结果。
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
第二行为看起来确实像有关交流转换器
的事实,你仍然有你的数据的错误。它看起来是前下游挑b.c
和评估的条件实际上是 a.newc =交流转换器
。
The second behavior looks indeed like a bug related to the fact that you still have a.c
in your data. It looks like it is picked downstream before b.c
and the evaluated condition is actually a.newc = a.c
.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")
这篇关于星火连接生成错误的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!