不在 Presto 与 Spark SQL 的实现中 [英] NOT IN implementation of Presto v.s Spark SQL

查看:23
本文介绍了不在 Presto 与 Spark SQL 的实现中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一个非常简单的查询,它显示了在相同硬件上运行 Spark SQL 和 Presto(3 小时 vs 3 分钟)时的显着性能差异.

SELECT 字段从测试 1WHERE 字段不在(从 test2 中选择字段)

经过对查询计划的一些研究,我发现原因是 Spark SQL 如何处理 NOT IN 谓词子查询.为了正确处理 NOT IN 的 NULL,Spark SQL 将 NOT IN 谓词翻译为 Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).>

Spark SQL 引入了 OR isNULL(test1=test2) 来确保 NOT IN 的正确语义.

然而,Left AntiJoin 连接谓词的OR 导致Left AntiJoin 唯一可行的物理连接策略是BroadcastNestedLoopJoin.对于当前阶段,我可以将 NOT IN 重写为 NOT EXISTS 以解决此问题.在 NOT EXISTS 的查询计划中,我可以看到连接谓词是 Left AntiJoin(test1=test2) 这导致更好的 NOT EXISTS 物理连接运算符(完成 5 分钟).

到目前为止,我很幸运,因为我的数据集目前没有任何 NULL 属性,但将来可能会有,而 NOT IN 的语义正是我真正想要的.

所以我检查了 Presto 的查询计划,它并没有真正提供 Left AntiJoin 但它使用 SemiJoinFilterPredicate = not (expr).Presto 的查询计划并没有像 Spark 那样提供太多的信息.

所以我的问题更像是:

我可以假设 Presto 有更好的物理连接运算符来处理 NOT IN 操作吗?不像 Spark SQL,它不依赖于连接谓词的重写 isnull(op1 = op2) 来确保逻辑计划级别中 NOT IN 的正确语义.

解决方案

实际上是我在 Presto 中实现了半连接(IN 谓词)的 NULL 处理的人.

除了散列分区¹之外,Presto 还使用复制空值和任何行"复制模式,这允许它在存在 NULL 的情况下正确处理 ININ 的一侧,不回退到广播,或使执行单线程或单节点.运行时性能成本实际上与 NULL 值根本不存在一样.

如果您想了解有关 Presto 内部结构的更多信息,请加入 Presto 上的 #dev 频道社区松弛.

¹) 准确地说,半连接是散列分区或广播,具体取决于基于成本的决策或配置.

I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

So my question is more like:

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.

解决方案

I am actually the person who implemented NULL treatment for semi join (IN predicate) in Presto.

Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN correctly in the presence of NULLs on either side of the IN, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL values didn't exist at all.

If you want to learn more about Presto internals, join the #dev channel on Presto Community Slack.

¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.

这篇关于不在 Presto 与 Spark SQL 的实现中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆