不在Presto v.s Spark SQL的实现中 [英] NOT IN implementation of Presto v.s Spark SQL
问题描述
我得到了一个非常简单的查询,该查询在同一硬件上运行Spark SQL和Presto时(3小时v.s 3分钟)显示出显着的性能差异.
I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.
SELECT field
FROM test1
WHERE field NOT IN (SELECT field FROM test2)
对查询计划进行了一些研究之后,我发现原因是Spark SQL如何处理NOT IN
谓词子查询.
为了正确处理NOT IN的NULL,Spark SQL将NOT IN
谓词转换为Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.
After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN
predicate subquery.
To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN
predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.
Spark SQL引入了OR isNULL(test1=test2)
以确保NOT IN
的正确语义.
Spark SQL introduces OR isNULL(test1=test2)
to ensure the correct semantics of NOT IN
.
但是,Left AntiJoin连接谓词的OR
导致Left AntiJoin
唯一可行的物理连接策略是BroadcastNestedLoopJoin
.在当前阶段,我可以将NOT IN改写为NOT EXISTS来解决此问题.在NOT EXISTS的查询计划中,我可以看到join谓词为Left AntiJoin(test1=test2)
,这会为NOT EXISTS(完成5分钟)带来更好的物理联接运算符.
However, the OR
of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin
is BroadcastNestedLoopJoin
. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2)
which causes a better physical join operator for NOT EXISTS (5 mins to finish).
到目前为止,我很幸运,因为我的数据集当前不具有任何NULL
属性,但是将来可能会具有,而NOT IN的语义正是我真正想要的.
So far I am lucky since my dataset currently does not have any NULL
attributes, but it may have in the future and the semantics of NOT IN is what I really want.
所以我检查了Presto的查询计划,它并没有真正提供Left AntiJoin
,但是它使用了SemiJoin
和FilterPredicate = not (expr)
. Presto的查询计划没有提供太多像Spark这样的信息.
So I check query plan of Presto, It does not really provides Left AntiJoin
but it uses SemiJoin
with a FilterPredicate = not (expr)
. The query plan of Presto does not provide too much info like Spark.
所以我的问题更像是:
我是否可以认为Presto具有更好的物理联接运算符来处理NOT IN
操作?与Spark SQL不同,它不依赖于连接谓词isnull(op1 = op2)
的重写来确保逻辑计划级别的NOT IN正确语义.
Could I assume Presto has a better physical join operator to handle NOT IN
operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2)
to ensure the correct semantics of NOT IN in the logical plan level.
推荐答案
我实际上是在Presto中对半联接(IN
谓词)实施NULL
处理的人.
I am actually the person who implemented NULL
treatment for semi join (IN
predicate) in Presto.
Presto除了使用散列分区¹之外,还使用复制空值和任何行"复制模式,这使它可以在IN
的任一侧都存在NULL
的情况下正确处理IN
,而不会回退广播,或使执行单线程或单节点.运行时性能成本实际上与NULL
值根本不存在一样.
Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN
correctly in the presence of NULL
s on either side of the IN
, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL
values didn't exist at all.
如果您想了解有关Presto内部的更多信息,请在 Presto社区松弛上加入#dev
频道.
If you want to learn more about Presto internals, join the #dev
channel on Presto Community Slack.
¹)确切地说,半连接是散列分区或广播,具体取决于基于成本的决策或配置.
¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.
这篇关于不在Presto v.s Spark SQL的实现中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!