用于EXISTS和IN的火花替换 [英] Spark replacement for EXISTS and IN

查看：130 发布时间：2020/9/4 18:48:34 sql apache-spark-sql

本文介绍了用于EXISTS和IN的火花替换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试运行使用EXIST子句的查询:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  exists (select A.ID from <subquery 1>) or 
  exists (select A.ID from <subquery 2>)

不幸的是，这似乎不受支持.我也尝试过用IN子句替换EXISTS子句:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID in (select ID from ...) or
  A.ID in (select ID from ...)

不幸的是，IN子句也似乎不受支持.

关于如何编写可实现所需结果的SQL查询的任何想法?我原则上可以将WHERE子句建模为另一个JOIN，将第二个OR子句建模为UNION，但是看起来非常笨拙.

列出许多可能的解决方案.

解决方案1 

select <...>    
  from A, B, C
       (select ID from ...) as exist_clause_1,
       (select ID from ...) as exist_clause_2,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause_1.ID or
  A.ID = exist_clause_2.ID

解决方案2

select <...>    
  from A, B, C
       ( (select ID from ...) UNION
         (select ID from ...)
        ) as exist_clause,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause.ID

解决方案

SparkSQL当前没有EXISTS&在. (最新)Spark SQL/数据框架和数据集指南/支持的Hive功能"

存在& IN始终可以使用JOIN或LEFT SEMI JOIN进行重写. 尽管Apache Spark SQL当前不支持IN或EXISTS子查询，但是您可以有效地通过重写查询以使用LEFT SEMI JOIN来实现语义." OR始终可以使用UNION进行重写.可以使用EXCEPT重写AND NOT.

表中包含使某些谓词(由列名参数化的语句)为真的行:

DBA使用列 T.C,... 给出每个基表 T 的谓词: T ( TC ，...)
A JOIN包含使其参数的谓词的AND成立的行；对于UNION，则为OR；对于EXCEPT，则为AND NOT.
SELECT DISTINCT kept columns FROM T 保存存在EXISTS 删除列的行 [T的谓词].
T LEFT SEMI JOIN U 保存其中存在EXISTS个 U-only列 [ T <的谓词的行/em>和U的谓词].

T WHERE condition 保存谓词T 和 condition 的行.

(重新查询通常参见此答案.)

因此，通过记住与SQL对应的谓词表达式，您可以使用简单的逻辑重写规则来构成和/或重新组织查询.例如，就可读性或执行性而言，在此处使用UNION不必笨拙".

您的原始问题表明您了解可以使用UNION，并且已对问题中的变体进行了编辑，这些变体从原始查询中排除了EXISTS和IN.这是也切除OR的另一个变体.
select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.id union select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.ID
您的解决方案1并没有您认为的那样.如果仅exists_clause个表中的一个为空，即，即使另一个中有ID个匹配项，则表的FROM叉积为空，并且不返回任何行. ("SQL语义的不直观后果":第6章数据库语言SQL侧栏页面264数据库系统:完整书第二版.) FROM不仅是为表行引入名称，还包括交叉联接和/或外部联接，然后在ON(对于INNER JOIN)和WHERE处进行过滤.

对于返回相同行的不同表达式，性能通常会有所不同.这取决于DBMS优化. DBMS和/或程序员可能会知道的许多细节，如果知道，可能知道或不知道，并且可能会也可能不会达到最佳平衡，它们会影响评估查询的最佳方式和编写查询的最佳方式.但是，在WHERE中每行执行两个ORed子选择(如您的原始查询以及后来的解决方案2)不一定比运行两个SELECT中的一个UNION(如我的查询)更好.
I am trying to run a query that uses the EXIST clause:
select <...> from A, B, C where A.FK_1 = B.PK and A.FK_2 = C.PK and exists (select A.ID from <subquery 1>) or exists (select A.ID from <subquery 2>)
Unfortunately, this does not seem to be supported. I have also tried replacing the EXISTS clause with an IN clause:
select <...> from A, B, C where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID in (select ID from ...) or A.ID in (select ID from ...)
Unfortunately, also the IN clause seems to be unsupported.

Any ideas of how I can write a SQL query that achieves the desired result? I could model in principle the WHERE clause as another JOIN and the second OR clause as an UNION but it seems super clumsy..

EDIT: Listing a number of possible solutions.

Solution 1
select <...> from A, B, C (select ID from ...) as exist_clause_1, (select ID from ...) as exist_clause_2, where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = exist_clause_1.ID or A.ID = exist_clause_2.ID
Solution 2
select <...> from A, B, C ( (select ID from ...) UNION (select ID from ...) ) as exist_clause, where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = exist_clause.ID

解决方案
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"

EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.

A table holds the rows that make some predicate (statement parameterized by column names) true:

The DBA gives the predicates for each base table T with columns T.C,...: T(T.C,...)

A JOIN holds the rows that make the AND of its arguments' predicates true; for a UNION, the OR; for an EXCEPT, the AND NOT.

SELECT DISTINCTkept columnsFROMT holds the rows where EXISTS dropped columns [predicate of T].

TLEFT SEMI JOINU holds the rows where EXISTS U-only columns [predicate of T AND predicate of U].

TWHEREcondition holds the rows where predicate of T AND condition.

(Re querying generally see this answer.)

So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.

Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.
select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.id union select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.ID
Your Solution 1 does not do what you think it does. If just one of the exists_clause tables are empty, ie even if there are ID matches available in the other, the FROM cross product of tables is empty and no rows are returned. ("An Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them after which ON (for INNER JOINs) and WHERE filter some out.

Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).

这篇关于用于EXISTS和IN的火花替换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于EXISTS和IN的火花替换 [英] Spark replacement for EXISTS and IN

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于EXISTS和IN的火花替换 [英] Spark replacement for EXISTS and IN

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭