在完全实现视图之前,是否可以回答视图查询? [英] Is it possible to answer queries on a view before fully materializing the view?

查看:88
本文介绍了在完全实现视图之前,是否可以回答视图查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简而言之:在不进行联接的情况下,左联接左侧的Distinct,Min,Max应该是可回答的.

In short: Distinct,Min,Max on the Left hand side of a Left Join, should be answerable without doing the join.

我正在使用SQL数组类型(在Postgres 9.3上)将多行数据压缩为一行,然后使用一个视图返回未嵌套的标准化视图.我这样做是为了节省索引成本,并使Postgres压缩数组中的数据.
事情运行得很好,但是可以在不取消嵌套和实例化/分解视图的情况下回答一些查询,这是很昂贵的,因为它们被推迟到实例化视图之后.有什么办法可以解决这个问题?

I’m using a SQL array type (on Postgres 9.3) to condense several rows of data in to a single row, and then a view to return the unnested normalized view. I do this to save on index costs, as well as to get Postgres to compress the data in the array.
Things work pretty well, but some queries that could be answered without unnesting and materializing/exploding the view are quite expensive because they are deferred till after the view is materialized. Is there any way to solve this?

这是基本表:

CREATE TABLE mt_count_by_day
(
  run_id integer NOT NULL,
  type character varying(64) NOT NULL,
  start_day date NOT NULL,
  end_day date NOT NULL,
  counts bigint[] NOT NULL,
  CONSTRAINT mt_count_by_day_pkey PRIMARY KEY (run_id, type),
)

类型"索引正好可以用来衡量:

An index on ‘type’ just for good measure:

CREATE INDEX runinfo_mt_count_by_day_type_idx on runinfo.mt_count_by_day (type);

这里是使用generate_series和unnest的视图

Here is the view that uses generate_series and unnest

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
 SELECT mt_count_by_day.run_id,
    mt_count_by_day.type,
    mt_count_by_day.brand,
    generate_series(mt_count_by_day.start_day::timestamp without time zone, mt_count_by_day.end_day - '1 day'::interval, '1 day'::interval) AS row_date,
    unnest(mt_count_by_day.counts) AS row_count
   FROM runinfo.mt_count_by_day;


如果我想在类型"列上做不同的事情怎么办?


What if I want to do distinct on the ‘type' column?

explain analyze select distinct(type) from mt_count_by_day;

"HashAggregate  (cost=9566.81..9577.28 rows=1047 width=19) (actual time=171.653..172.019 rows=1221 loops=1)"
"  ->  Seq Scan on mt_count_by_day  (cost=0.00..9318.25 rows=99425 width=19) (actual time=0.089..99.110 rows=99425 loops=1)"
"Total runtime: 172.338 ms"

现在,如果我在视图上执行相同的操作会怎样?

Now what happens if I do the same on the view?

explain analyze select distinct(type) from v_mt_count_by_day;

"HashAggregate  (cost=1749752.88..1749763.34 rows=1047 width=19) (actual time=58586.934..58587.191 rows=1221 loops=1)"
"  ->  Subquery Scan on v_mt_count_by_day  (cost=0.00..1501190.38 rows=99425000 width=19) (actual time=0.114..37134.349 rows=68299959 loops=1)"
"        ->  Seq Scan on mt_count_by_day  (cost=0.00..506940.38 rows=99425000 width=597) (actual time=0.113..24907.147 rows=68299959 loops=1)"
"Total runtime: 58587.474 ms"

有没有办法让postgres认识到它可以解决此问题而无需先展开视图?

Is there a way to get postgres to recognize that it can solve this without first exploding the view?

在这里,为了进行比较,我们在计算表与视图中符合条件的行数.一切都按预期进行.在实现视图之前,Postgres会过滤掉行.不太一样,但是此属性使我们的数据更易于管理.

Here we can see for comparison we are counting the number of rows matching criteria in the table vs the view. Everything works as expected. Postgres filters down the rows before materializing the view. Not quite the same, but this property is what makes our data more manageable.

explain analyze select count(*) from mt_count_by_day where type = ’SOCIAL_GOOGLE'
"Aggregate  (cost=157.01..157.02 rows=1 width=0) (actual time=0.538..0.538 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on mt_count_by_day  (cost=4.73..156.91 rows=40 width=0) (actual time=0.139..0.509 rows=122 loops=1)"
"        Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"        ->  Bitmap Index Scan on runinfo_mt_count_by_day_type_idx  (cost=0.00..4.72 rows=40 width=0) (actual time=0.098..0.098 rows=122 loops=1)"
"              Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 0.625 ms"

explain analyze select count(*) from v_mt_count_by_day where type = 'SOCIAL_GOOGLE'
"Aggregate  (cost=857.11..857.12 rows=1 width=0) (actual time=6.827..6.827 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on mt_count_by_day  (cost=4.73..357.11 rows=40000 width=597) (actual time=0.124..5.294 rows=15916 loops=1)"
"        Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"        ->  Bitmap Index Scan on runinfo_mt_count_by_day_type_idx  (cost=0.00..4.72 rows=40 width=0) (actual time=0.082..0.082 rows=122 loops=1)"
"              Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 6.885 ms"


以下是复制此代码所需的代码:


Here is the code required to reproduce this:

CREATE TABLE base_table
(
  run_id integer NOT NULL,
  type integer NOT NULL,
  start_day date NOT NULL,
  end_day date NOT NULL,
  counts bigint[] NOT NULL
  CONSTRAINT match_check CHECK (end_day > start_day  AND (end_day - start_day) = array_length(counts, 1)),
  CONSTRAINT base_table_pkey PRIMARY KEY (run_id, type)
);

--Just because...
CREATE INDEX base_type_idx on base_table (type);

CREATE OR REPLACE VIEW v_foo AS
SELECT m.run_id,
       m.type,
       t.row_date::date,
       t.row_count
FROM   base_table m
LEFT   JOIN LATERAL ROWS FROM (
          unnest(m.counts),
          generate_series(m.start_day, m.end_day-1, interval '1d')
       ) t(row_count, row_date) ON true;



insert into base_table
select a.run_id, a.type, '20120101'::date as start_day, '20120401'::date as end_day, b.counts  from (SELECT N AS run_id, L as type
FROM
    generate_series(1, 10000) N
CROSS JOIN
    generate_series(1, 7) L
ORDER BY N, L) a,  (SELECT array_agg(generate_series)::bigint[] as counts FROM generate_series(1, 91) ) b

以及9.4.1的结果:

And the results on 9.4.1:

解释分析从base_table中选择不同的类型;

explain analyze select distinct type from base_table;

"HashAggregate  (cost=6750.00..6750.03 rows=3 width=4) (actual time=51.939..51.940 rows=3 loops=1)"
"  Group Key: type"
"  ->  Seq Scan on base_table  (cost=0.00..6600.00 rows=60000 width=4) (actual time=0.030..33.655 rows=60000 loops=1)"
"Planning time: 0.086 ms"
"Execution time: 51.975 ms"

解释分析选择与v_foo不同的类型;

explain analyze select distinct type from v_foo;

"HashAggregate  (cost=1356600.01..1356600.04 rows=3 width=4) (actual time=9215.630..9215.630 rows=3 loops=1)"
"  Group Key: m.type"
"  ->  Nested Loop Left Join  (cost=0.01..1206600.01 rows=60000000 width=4) (actual time=0.112..7834.094 rows=5460000 loops=1)"
"        ->  Seq Scan on base_table m  (cost=0.00..6600.00 rows=60000 width=764) (actual time=0.009..42.694 rows=60000 loops=1)"
"        ->  Function Scan on t  (cost=0.01..10.01 rows=1000 width=0) (actual time=0.091..0.111 rows=91 loops=60000)"
"Planning time: 0.132 ms"
"Execution time: 9215.686 ms"

推荐答案

通常,Postgres查询计划程序使用 内联"视图来优化整个查询. 每个文档:

Generally, the Postgres query planner does "inline" views to optimize the whole query. Per documentation:

重写系统的一个应用是视图的实现. 每当针对视图(即虚拟表)进行查询时, 重写系统将用户查询重写为访问该查询的查询 视图定义中给定的基本表.

One application of the rewrite system is in the realization of views. Whenever a query against a view (i.e., a virtual table) is made, the rewrite system rewrites the user's query to a query that accesses the base tables given in the view definition instead.

但是我认为Postgres不够聪明不能得出结论,它可以在不爆炸行的情况下从基表中获得相同的结果.

But I don't think Postgres is smart enough to conclude that it can reach the same result from the base table without exploding rows.

您可以尝试使用LATERAL连接执行此替代查询.比较干净:

You can try this alternative query with a LATERAL join. It's cleaner:

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
     , m.start_day + c.rn - 1 AS row_date
     , c.row_count
FROM   runinfo.mt_count_by_day m
LEFT   JOIN LATERAL unnest(m.counts) WITH ORDINALITY c(row_count, rn) ON true;

这也很清楚(end_daystart_day)之一是多余的.

It also makes clear that one of (end_day, start_day) is redundant.

使用LEFT JOIN,因为这可能会使查询计划者忽略查询中的联接:

Using LEFT JOIN because that might allow the query planner to ignore the join from your query:

   SELECT DISTINCT type FROM v_mt_count_by_day;

否则(使用CROSS JOININNER JOIN),它必须评估联接以查看是否消除了第一个表中的行.

Else (with a CROSS JOIN or INNER JOIN) it must evaluate the join to see whether rows from the first table are eliminated.

顺便说一句,是:

SELECT DISTINCT type ...

不是:

SELECT DISTINCT(type) ...

请注意,这将返回date而不是原始文档中的时间戳.更轻松,我想这还是您想要的吗?

Note that this returns a date instead of the timestamp in your original. Easer, and I guess it's what you want anyway?

需要Postgres 9.3 + 详细信息:

Requires Postgres 9.3+ Details:

为了安全地并行爆炸两列:

To explode both columns in parallel safely:

CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
       t.row_date::date, t.row_count
FROM   runinfo.mt_count_by_day m
LEFT   JOIN LATERAL ROWS FROM (
          unnest(m.counts)
        , generate_series(m.start_day, m.end_day, interval '1d')
       ) t(row_count, row_date) ON true;

主要优点:如果两个SRF没有返回相同的行数,则不会破坏笛卡尔乘积.相反,将填充NULL值.

The main benefit: This would not derail into a Cartesian product if the two SRF don't return the same number of rows. Instead, NULL values would be padded.

同样,我不能说这是否可以在没有测试的情况下为DISTINCT type的更快计划提供帮助,从而帮助查询计划者.

Again, I can't say whether this would help the query planner with a faster plan for DISTINCT type without testing.

这篇关于在完全实现视图之前,是否可以回答视图查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆