不在 Presto 与 Spark SQL 的实现中 [英] NOT IN implementation of Presto v.s Spark SQL
问题描述
我得到了一个非常简单的查询,它显示了在相同硬件上运行 Spark SQL 和 Presto(3 小时 vs 3 分钟)时的显着性能差异.
SELECT 字段从测试 1WHERE 字段不在(从 test2 中选择字段)
经过对查询计划的一些研究,我发现原因是 Spark SQL 如何处理 NOT IN
谓词子查询.为了正确处理 NOT IN 的 NULL,Spark SQL 将 NOT IN
谓词翻译为 Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.>
Spark SQL 引入了 OR isNULL(test1=test2)
来确保 NOT IN
的正确语义.
然而,Left AntiJoin 连接谓词的OR
导致Left AntiJoin
唯一可行的物理连接策略是BroadcastNestedLoopJoin
.对于当前阶段,我可以将 NOT IN 重写为 NOT EXISTS 以解决此问题.在 NOT EXISTS 的查询计划中,我可以看到连接谓词是 Left AntiJoin(test1=test2)
这导致更好的 NOT EXISTS 物理连接运算符(完成 5 分钟).
到目前为止,我很幸运,因为我的数据集目前没有任何 NULL
属性,但将来可能会有,而 NOT IN 的语义正是我真正想要的.
所以我检查了 Presto 的查询计划,它并没有真正提供 Left AntiJoin
但它使用 SemiJoin
和 FilterPredicate = not (expr)
.Presto 的查询计划并没有像 Spark 那样提供太多的信息.
所以我的问题更像是:
我可以假设 Presto 有更好的物理连接运算符来处理 NOT IN
操作吗?不像 Spark SQL,它不依赖于连接谓词的重写 isnull(op1 = op2)
来确保逻辑计划级别中 NOT IN 的正确语义.
实际上是我在 Presto 中实现了半连接(IN
谓词)的 NULL
处理的人.
除了散列分区¹之外,Presto 还使用复制空值和任何行"复制模式,这允许它在存在 NULL
的情况下正确处理 IN
IN
的一侧,不回退到广播,或使执行单线程或单节点.运行时性能成本实际上与 NULL
值根本不存在一样.
如果您想了解有关 Presto 内部结构的更多信息,请加入 Presto 上的 #dev
频道社区松弛.
¹) 准确地说,半连接是散列分区或广播,具体取决于基于成本的决策或配置.
I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.
SELECT field
FROM test1
WHERE field NOT IN (SELECT field FROM test2)
After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN
predicate subquery.
To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN
predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.
Spark SQL introduces OR isNULL(test1=test2)
to ensure the correct semantics of NOT IN
.
However, the OR
of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin
is BroadcastNestedLoopJoin
. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2)
which causes a better physical join operator for NOT EXISTS (5 mins to finish).
So far I am lucky since my dataset currently does not have any NULL
attributes, but it may have in the future and the semantics of NOT IN is what I really want.
So I check query plan of Presto, It does not really provides Left AntiJoin
but it uses SemiJoin
with a FilterPredicate = not (expr)
. The query plan of Presto does not provide too much info like Spark.
So my question is more like:
Could I assume Presto has a better physical join operator to handle NOT IN
operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2)
to ensure the correct semantics of NOT IN in the logical plan level.
I am actually the person who implemented NULL
treatment for semi join (IN
predicate) in Presto.
Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN
correctly in the presence of NULL
s on either side of the IN
, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL
values didn't exist at all.
If you want to learn more about Presto internals, join the #dev
channel on Presto Community Slack.
¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.
这篇关于不在 Presto 与 Spark SQL 的实现中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!