EXISTS 和 IN 的火花替换 [英] Spark replacement for EXISTS and IN

查看:27
本文介绍了EXISTS 和 IN 的火花替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行使用 EXIST 子句的查询:

I am trying to run a query that uses the EXIST clause:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  exists (select A.ID from <subquery 1>) or 
  exists (select A.ID from <subquery 2>) 

不幸的是,这似乎不受支持.我还尝试用 IN 子句替换 EXISTS 子句:

Unfortunately, this does not seem to be supported. I have also tried replacing the EXISTS clause with an IN clause:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID in (select ID from ...) or
  A.ID in (select ID from ...)

不幸的是,IN 子句似乎也不受支持.

Unfortunately, also the IN clause seems to be unsupported.

关于如何编写达到预期结果的 SQL 查询的任何想法?我原则上可以将 WHERE 子句建模为另一个 JOIN 并将第二个 OR 子句建模为 UNION 但它似乎超级笨拙..

Any ideas of how I can write a SQL query that achieves the desired result? I could model in principle the WHERE clause as another JOIN and the second OR clause as an UNION but it seems super clumsy..

列出一些可能的解决方案.

解决方案 1

select <...>    
  from A, B, C
       (select ID from ...) as exist_clause_1,
       (select ID from ...) as exist_clause_2,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause_1.ID or
  A.ID = exist_clause_2.ID

解决方案 2

select <...>    
  from A, B, C
       ( (select ID from ...) UNION
         (select ID from ...)
        ) as exist_clause,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause.ID

推荐答案

SparkSQL 目前没有 EXISTS &在."(最新)Spark SQL/数据帧和数据集指南/支持的 Hive 功能"

SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"

存在&IN 总是可以使用 JOIN 或 LEFT SEMI JOIN 重写."虽然 Apache Spark SQL 目前不支持 IN 或 EXISTS 子查询,但您可以有效地通过重写查询以使用 LEFT SEMI JOIN 来实现语义." OR 始终可以使用 UNION 重写.AND NOT 可以使用 EXCEPT 重写.

EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.

一个表包含使某些谓词(由列名参数化的语句)为真的行:

A table holds the rows that make some predicate (statement parameterized by column names) true:

  • DBA 为每个带有 TC,... 列的基表 T 提供谓词:T(TC,...)
  • A JOIN 保存使其参数谓词的 AND 为真的行;对于 UNION,OR;对于 EXCEPT,AND NOT.
  • SELECT DISTINCT保留列FROMT 保存 EXISTS 删除列的行 [T 的谓词].
  • TLEFT SEMI JOINU 保存 EXISTS U-only 列 [T 的谓词 AND U 的谓词].
  • TWHEREcondition 保存谓词的行T AND 条件.
  • The DBA gives the predicates for each base table T with columns T.C,...: T(T.C,...)
  • A JOIN holds the rows that make the AND of its arguments' predicates true; for a UNION, the OR; for an EXCEPT, the AND NOT.
  • SELECT DISTINCTkept columnsFROMT holds the rows where EXISTS dropped columns [predicate of T].
  • TLEFT SEMI JOINU holds the rows where EXISTS U-only columns [predicate of T AND predicate of U].
  • TWHEREcondition holds the rows where predicate of T AND condition.

(重新查询一般见这个答案.)

因此,通过记住对应于 SQL 的谓词表达式,您可以使用简单的逻辑重写规则来组合和/或重新组织查询.例如,在这里使用 UNION 在可读性或执行方面都不必笨拙".

So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.

您的原始问题表明您了解可以使用 UNION,并且您已将变体编辑到您的问题中,从原始查询中删除 EXISTS 和 IN.这是另一个也切除 OR 的变体.

Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.

    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.id
union
    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.ID

您的解决方案 1 没有按照您的想法行事.如果只有一个 exists_clause 表是空的,即即使在另一个表中有 ID 匹配可用,表的 FROM 叉积为空并且不返回任何行.(SQL 语义的非直观后果":数据库系统第 6 章数据库语言 SQL 侧边栏第 264 页:The Complete Book 2nd Edition.) A FROM 不只是介绍表行的名称,它是 CROSS JOINing 和/或 OUTER JOINing 它们之后 ON(对于 INNER JOINs)和 WHERE 过滤掉一些.

Your Solution 1 does not do what you think it does. If just one of the exists_clause tables are empty, ie even if there are ID matches available in the other, the FROM cross product of tables is empty and no rows are returned. ("An Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them after which ON (for INNER JOINs) and WHERE filter some out.

对于返回相同行的不同表达式,性能通常是不同的.这取决于 DBMS 优化.DBMS 和/或程序员可能知道的许多细节,如果知道,可能知道也可能不知道,可能会或可能不会最好地平衡,影响评估查询的最佳方式和编写查询的最佳方式.但是,在 WHERE 中每行执行两个 ORed 子选择(如在您的原始查询中以及在您后期的解决方案 2 中)不一定比运行两个 SELECT 的一个联合(如在我的查询中)更好.

Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).

这篇关于EXISTS 和 IN 的火花替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆