不在Presto v.s Spark SQL的实现中 [英] NOT IN implementation of Presto v.s Spark SQL

查看：101 发布时间：2020/9/4 19:42:17 null apache-spark-sql presto isnull

本文介绍了不在Presto v.s Spark SQL的实现中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我得到了一个非常简单的查询，该查询在同一硬件上运行Spark SQL和Presto时(3小时v.s 3分钟)显示出显着的性能差异.

I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

对查询计划进行了一些研究之后，我发现原因是Spark SQL如何处理NOT IN谓词子查询. 为了正确处理NOT IN的NULL，Spark SQL将NOT IN谓词转换为Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL引入了OR isNULL(test1=test2)以确保NOT IN的正确语义.

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

但是，Left AntiJoin连接谓词的OR导致Left AntiJoin唯一可行的物理连接策略是BroadcastNestedLoopJoin.在当前阶段，我可以将NOT IN改写为NOT EXISTS来解决此问题.在NOT EXISTS的查询计划中，我可以看到join谓词为Left AntiJoin(test1=test2)，这会为NOT EXISTS(完成5分钟)带来更好的物理联接运算符.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

到目前为止，我很幸运，因为我的数据集当前不具有任何NULL属性，但是将来可能会具有，而NOT IN的语义正是我真正想要的.

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

所以我检查了Presto的查询计划，它并没有真正提供Left AntiJoin，但是它使用了SemiJoin和FilterPredicate = not (expr). Presto的查询计划没有提供太多像Spark这样的信息.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

所以我的问题更像是:

我是否可以认为Presto具有更好的物理联接运算符来处理NOT IN操作?与Spark SQL不同，它不依赖于连接谓词isnull(op1 = op2)的重写来确保逻辑计划级别的NOT IN正确语义.

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.

不在Presto v.s Spark SQL的实现中 [英] NOT IN implementation of Presto v.s Spark SQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

不在Presto v.s Spark SQL的实现中 [英] NOT IN implementation of Presto v.s Spark SQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭