不在Presto v.s Spark SQL的实现中 [英] NOT IN implementation of Presto v.s Spark SQL

查看:101
本文介绍了不在Presto v.s Spark SQL的实现中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一个非常简单的查询,该查询在同一硬件上运行Spark SQL和Presto时(3小时v.s 3分钟)显示出显着的性能差异.

I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

对查询计划进行了一些研究之后,我发现原因是Spark SQL如何处理NOT IN谓词子查询. 为了正确处理NOT IN的NULL,Spark SQL将NOT IN谓词转换为Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL引入了OR isNULL(test1=test2)以确保NOT IN的正确语义.

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

但是,Left AntiJoin连接谓词的OR导致Left AntiJoin唯一可行的物理连接策略是BroadcastNestedLoopJoin.在当前阶段,我可以将NOT IN改写为NOT EXISTS来解决此问题.在NOT EXISTS的查询计划中,我可以看到join谓词为Left AntiJoin(test1=test2),这会为NOT EXISTS(完成5分钟)带来更好的物理联接运算符.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

到目前为止,我很幸运,因为我的数据集当前不具有任何NULL属性,但是将来可能会具有,而NOT IN的语义正是我真正想要的.

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

所以我检查了Presto的查询计划,它并没有真正提供Left AntiJoin,但是它使用了SemiJoinFilterPredicate = not (expr). Presto的查询计划没有提供太多像Spark这样的信息.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

所以我的问题更像是:

我是否可以认为Presto具有更好的物理联接运算符来处理NOT IN操作?与Spark SQL不同,它不依赖于连接谓词isnull(op1 = op2)的重写来确保逻辑计划级别的NOT IN正确语义.

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.

推荐答案

我实际上是在Presto中对半联接(IN谓词)实施NULL处理的人.

I am actually the person who implemented NULL treatment for semi join (IN predicate) in Presto.

Presto除了使用散列分区¹之外,还使用复制空值和任何行"复制模式,这使它可以在IN的任一侧都存在NULL的情况下正确处理IN,而不会回退广播,或使执行单线程或单节点.运行时性能成本实际上与NULL值根本不存在一样.

Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN correctly in the presence of NULLs on either side of the IN, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL values didn't exist at all.

如果您想了解有关Presto内部的更多信息,请在 Presto社区松弛上加入#dev频道.

If you want to learn more about Presto internals, join the #dev channel on Presto Community Slack.

¹)确切地说,半连接是散列分区或广播,具体取决于基于成本的决策或配置.

¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.

这篇关于不在Presto v.s Spark SQL的实现中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆