如何在加入/下推到外部服务器之前强制评估子查询 [英] How to force evaluation of subquery before joining / pushing down to foreign server
问题描述
假设我想用几个 WHERE
过滤器查询一个大表。我正在使用Postgres 11和一个外部表;外部数据包装器(FDW)是 clickhouse_fdw
。但是我也对通用解决方案感兴趣。
Suppose I want to query a big table with a few WHERE
filters. I am using Postgres 11 and a foreign table; foreign data wrapper (FDW) is clickhouse_fdw
. But I am also interested in a general solution.
我可以这样做,如下所示:
I can do so as follows:
SELECT id,c1,c2,c3 from big_table where id=3 and c1=2
我的FDW能够对远程外部数据源进行过滤,确保上面的查询是快速的,并且不会提取太多数据。
My FDW is able to do the filtering on the remote foreign data source, ensuring that the above query is quick and doesn't pull down too much data.
如果我这样写,上面的工作原理是一样的:
The above works the same if I write:
SELECT id,c1,c2,c3 from big_table where id IN (3,4,5) and c1=2
即所有过滤都向下游发送。
I.e all of the filtering is sent downstream.
但是,如果我要进行的过滤稍微复杂一点:
However, if the filtering I'm trying to do is slightly more complex:
SELECT bt.id,bt.c1,bt.c2,bt.c3
from big_table bt
join lookup_table l on bt.id=l.id
where c1=2 and l.x=5
然后查询计划者决定根据 c1 = 2 $ c进行过滤$ c>远程,但在本地应用其他过滤器。
then the query planner decides to filter on c1=2
remotely but apply the other filter locally.
在我的用例中,计算哪个 id
首先具有 lx = 5
,然后将其发送以进行远程过滤会更快,所以我尝试了可以这样写:
In my use case, calculating which id
s have l.x=5
first and then sending those off to be filtered remotely will be much quicker, so I tried to write it the following way:
SELECT id,c1,c2,c3
from big_table
where c1=2
and id IN (select id from lookup_table where x=5)
,查询计划者仍决定对 big_table
中满足 c1 = 2
的所有结果进行本地第二过滤,这非常慢。
However, the query planner still decides to perform the second filter locally on ALL of the results from big_table
that satisfy c1=2
, which is very slow.
有什么方法可以强制 (从lookup_table中选择id,其中x = 5)
是否要预先计算并作为远程过滤器的一部分发送?
Is there some way I can "force" (select id from lookup_table where x=5)
to be pre-calculated and sent as part of a remote filter?
推荐答案
外国数据包装器
通常,联接或子查询或CTE的任何派生表在外部服务器上不可用,必须在本地执行。即,示例中简单的 WHERE
子句之后剩余的所有行都必须像您观察到的那样在本地进行检索和处理。
Foreign data wrapper
Typically, joins or any derived tables from subqueries or CTEs are not available on the foreign server and have to be executed locally. I.e., all rows remaining after the simple WHERE
clause in your example have to be retrieved and processed locally like you observed.
如果所有其他方法均失败,则可以执行子查询从lookup_table WHERE x = 5
中选择ID并将结果连接到查询字符串中。
If all else fails you can execute the subquery SELECT id FROM lookup_table WHERE x = 5
and concatenate results into the query string.
更方便的是,您可以在PL / pgSQL函数中使用动态SQL和 EXECUTE
自动执行此操作。像这样:
More conveniently, you can automate this with dynamic SQL and EXECUTE
in a PL/pgSQL function. Like:
CREATE OR REPLACE FUNCTION my_func(_c1 int, _l_id int)
RETURNS TABLE(id int, c1 int, c2 int, c3 int) AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT id,c1,c2,c3 FROM big_table
WHERE c1 = $1
AND id = ANY ($2)'
USING _c1
, ARRAY(SELECT l.id FROM lookup_table l WHERE l.x = _l_id);
END
$func$ LANGUAGE plpgsql;
相关:
- Table name as a PostgreSQL function parameter
或尝试在SO上进行搜索。
或者您可以使用元命令 \ psql中的gexec
。请参阅:
Or you might use the meta-command \gexec
in psql. See:
- Filter column names from existing table for SQL DDL statement
或者这可能有用:(反馈说无效。)
SELECT id,c1,c2,c3
FROM big_table
WHERE c1 = 2
AND id = ANY (ARRAY(SELECT id FROM lookup_table WHERE x = 5));
在本地测试,我得到这样的查询计划:
Testing locally, I get a query plan like this:
Index Scan using big_table_idx on big_table (cost= ...)
Index Cond: (id = ANY ($0))
Filter: (c1 = 2)
InitPlan 1 (returns $0)
-> Seq Scan on lookup_table (cost= ...)
Filter: (x = 5)
加粗强调。
参数 $ 0 $ c $在计划中激发希望。生成的数组可能是Postgres可以传递给远程使用的东西。您没有其他尝试或自己尝试过的类似计划,也没有看到类似的计划。
有关 postgres_fdw
:
- postgres_fdw: possible to push data to foreign server for join?
这是一个不同的故事。只需使用CTE。
That's a different story. Just use a CTE. But I don't expect that to help with the FDW.
WITH cte AS (SELECT id FROM lookup_table WHERE x = 5)
SELECT id,c1,c2,c3
FROM big_table b
JOIN cte USING (id)
WHERE b.c1 = 2;
PostgreSQL 12 的行为已更改(改进),因此可以内联CTE像子查询一样,有一些先决条件。但是,引用手册:
PostgreSQL 12 changed (improved) behavior, so that CTEs can be inlined like subqueries, given some preconditions. But, quoting the manual:
您可以通过指定
MATERIALIZED
来强制执行WITH查询的单独计算
You can override that decision by specifying
MATERIALIZED
to force separate calculation of the WITH query
所以:
WITH cte AS MATERIALIZED (SELECT id FROM lookup_table WHERE x = 5)
...
通常,这些都不是必需的如果您的数据库服务器配置正确并且列统计信息是最新的。但是,有些极端情况下数据分布不均...
Typically, none of this should be necessary if your DB server is configured properly and column statistics are up to date. But there are corner cases with uneven data distribution ...
这篇关于如何在加入/下推到外部服务器之前强制评估子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!