使用非不同索引上的递归cte计算不同的行 [英] Counting distinct rows using recursive cte over non-distinct index

查看:157
本文介绍了使用非不同索引上的递归cte计算不同的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定以下模式:

CREATE TABLE identifiers (
  id TEXT PRIMARY KEY
);

CREATE TABLE days (
  day DATE PRIMARY KEY
);

CREATE TABLE data (
  id TEXT REFERENCES identifiers
  , day DATE REFERENCES days
  , values NUMERIC[] 
); 
CREATE INDEX ON data (id, day);

计算两个时间戳之间的所有不同日期的最佳方法是什么?我尝试了以下两种方法:

What is the best way to count all distinct days between two timestamps? I've tried the following two methods:

EXPLAIN ANALYZE
SELECT COUNT(DISTINCT day) 
FROM data 
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=200331.32..200331.33 rows=1 width=4) (actual time=1647.574..1647.575 rows=1 loops=1)
   ->  Index Only Scan using data_day_sid_idx on data  (cost=0.56..196942.12 rows=1355678 width=4) (actual time=0.348..1180.566 rows=1362532 loops=1)
         Index Cond: ((day >= '2010-01-01'::date) AND (day <= '2011-01-01'::date))
         Heap Fetches: 0
 Total runtime: 1647.865 ms
(5 rows)

EXPLAIN ANALYZE
SELECT COUNT(DISTINCT day) 
FROM days
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=18.95..18.96 rows=1 width=4) (actual time=0.481..0.481 rows=1 loops=1)
   ->  Index Only Scan using days_pkey on days  (cost=0.28..18.32 rows=252 width=4) (actual time=0.093..0.275 rows=252 loops=1)
         Index Cond: ((day >= '2010-01-01'::date) AND (day <= '2011-01-01'::date))
         Heap Fetches: 252
 Total runtime: 0.582 ms
(5 rows)

COUNT(DISTINCT天) days 运行良好,但它要求我保留一个辅助表( days )以保持性能合理。在一般意义上,我想测试一个递归cte是否允许我实现类似的性能而不维护一个辅助表。我的查询看起来像这样,但尚未运行:

The COUNT(DISTINCT day) against days runs well, but it requires me to keep a secondary table (days) to keep the performance reasonable. In a general sense, I'd like to test if a recursive cte will allow me to achieve similar performance without maintaining a secondary table. My query looks like this, but doesn't run yet:

EXPLAIN ANALYZE
WITH RECURSIVE cte AS (
   (SELECT day FROM data ORDER BY 1 LIMIT 1)
   UNION ALL
   (  -- parentheses required
   SELECT d.day
   FROM   cte  c
   JOIN   data d ON d.day > c.day
   ORDER  BY 1 LIMIT 1
   )
)
SELECT day 
FROM cte
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';

更新

感谢大家的想法。看起来像维护一个基于触发器的不同天的表是最好的方式,存储和性能。感谢@Erwin的更新,递归CTE重新运行。很有用。

Thanks to everyone for the ideas. Looks like maintaining a trigger-based table of distinct days is the best way to go, both storage and performance-wise. Thanks to @Erwin's update, the recursive CTE is back in the running. Very useful.

WITH RECURSIVE cte AS (
   (  -- parentheses required because of LIMIT
   SELECT day
   FROM   data
   WHERE  day >= '2010-01-01'::date  -- exclude irrelevant rows early
   ORDER  BY 1
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT day FROM data
           WHERE  day > c.day
           AND    day < '2011-01-01'::date  -- see comments below
           ORDER  BY 1
           LIMIT  1)
   FROM   cte c
   WHERE  day IS NOT NULL  -- necessary because corr. subq. always returns row
   )
SELECT count(*) AS ct
FROM   cte
WHERE  day IS NOT NULL;

                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=53.35..53.36 rows=1 width=0) (actual time=18.217..18.217 rows=1 loops=1)
   CTE cte
     ->  Recursive Union  (cost=0.43..51.08 rows=101 width=4) (actual time=0.194..17.594 rows=253 loops=1)
           ->  Limit  (cost=0.43..0.46 rows=1 width=4) (actual time=0.191..0.192 rows=1 loops=1)
                 ->  Index Only Scan using data_day_idx on data data_1  (cost=0.43..235042.00 rows=8255861 width=4) (actual time=0.189..0.189 rows=1 loops=1)
                       Index Cond: (day >= '2010-01-01'::date)
                       Heap Fetches: 0
           ->  WorkTable Scan on cte c  (cost=0.00..4.86 rows=10 width=4) (actual time=0.066..0.066 rows=1 loops=253)
                 Filter: (day IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.43..0.47 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=252)
                         ->  Index Only Scan using data_day_idx on data  (cost=0.43..1625.59 rows=52458 width=4) (actual time=0.060..0.060 rows=1 loops=252)
                               Index Cond: ((day > c.day) AND (day < '2011-01-01'::date))
                               Heap Fetches: 0
   ->  CTE Scan on cte  (cost=0.00..2.02 rows=100 width=0) (actual time=0.199..18.066 rows=252 loops=1)
         Filter: (day IS NOT NULL)
         Rows Removed by Filter: 1
 Total runtime: 19.355 ms
(19 rows)

并且还讨论了 EXISTS 查询

EXPLAIN ANALYZE
SELECT count(*) AS ct
FROM   generate_series('2010-01-01'::date, '2010-12-31'::date, '1d'::interval) d(day)
WHERE  EXISTS (SELECT 1 FROM data WHERE day = d.day::date);

                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=674.32..674.33 rows=1 width=0) (actual time=95.049..95.049 rows=1 loops=1)
   ->  Nested Loop Semi Join  (cost=0.45..673.07 rows=500 width=0) (actual time=12.438..94.749 rows=252 loops=1)
         ->  Function Scan on generate_series d  (cost=0.01..10.01 rows=1000 width=8) (actual time=9.248..9.669 rows=365 loops=1)
         ->  Index Only Scan using data_day_idx on data  (cost=0.44..189.62 rows=6023 width=4) (actual time=0.227..0.227 rows=1 loops=365)
               Index Cond: (day = (d.day)::date)
               Heap Fetches: 0
 Total runtime: 95.620 ms
(7 rows)


推荐答案

几个注释:

SELECT COUNT(DISTINCT day) 
FROM   days
WHERE  day BETWEEN '2010-01-01' AND '2011-01-01';

day 定义为PK,因此 DISTINCT 在这种情况下只是一种浪费。

day is defined as PK, so DISTINCT would just be a waste in this case.

如果有表, 。该技术支付如果每天有多行,因此相当于松散索引扫描实际上比基表上的简单 DISTINCT 更快:

WITH RECURSIVE cte AS (
   (  -- parentheses required because of LIMIT
   SELECT day
   FROM   data
   WHERE  day >= '2010-01-01'::date  -- exclude irrelevant rows early
   ORDER  BY 1
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT day FROM data
           WHERE  day > c.day
           AND    day < '2011-01-01'::date  -- see comments below
           ORDER  BY 1
           LIMIT  1)
   FROM   cte c
   WHERE  day IS NOT NULL  -- necessary because corr. subq. always returns row
   )
SELECT count(*) AS ct
FROM   cte
WHERE  day IS NOT NULL;



索引



仅与数据上的匹配索引结合才有意义:

Index

Only makes sense in combination with a matching index on data:

CREATE INDEX data_day_idx ON data (day);

day 您也可以使用(id,day)的问题中的索引,但效率远低于:

day must be the leading column. The index you have in the question on (id, day) can be used too, but is far less efficient:

  • Working of indexes in PostgreSQL
  • Is a composite index also good for queries on the first field?

  • 提前排除不相关的行要便宜得多。

  • It is much cheaper to exclude irrelevant rows early. I integrated your predicate into the query.

详细说明:

手头的案例更简单 - 实际上是最简单的。

The case at hand is even simpler - the simplest possible actually.

您的原始时间范围为天BETWEEN'2010-01-01'AND'2011-01-01' - 这可能不是想要的。 BETWEEN .. AND .. 包括上限和下限,这样您将获得2010年的所有内容加2011-01-01。您似乎希望排除最后一天。所以使用 d.day< '2011-01-01'(不< = )。

Your original time frame was day BETWEEN '2010-01-01' AND '2011-01-01' - which is probably not as intended. BETWEEN .. AND .. includes upper and lower bound, this way you would get all of 2010 plus 2011-01-01. It seems more likely you want to exclude the last day. So using d.day < '2011-01-01' (not <=).

可枚举天的范围(与具有无限数量的可能值的范围相反),您可以使用 EXISTS 半连接:

Since you are testing for a range of enumerable days (as opposed to a range with an infinite number of possible values), you can test this alternative with an EXISTS semi-join:

SELECT count(*) AS ct
FROM   generate_series('2010-01-01'::date, '2010-12-31'::date, '1d'::interval) d(day)
WHERE  EXISTS (SELECT 1 FROM data WHERE day = d.day::date);

同样,简单的索引也是必不可少的。

Again, the same simple index is essential.

SQL Fiddle 演示了两个查询的大测试表格为160k行。

SQL Fiddle demonstrating both queries with a big test table of 160k rows.

这篇关于使用非不同索引上的递归cte计算不同的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆