如何获取date_part查询以命中索引? [英] How to get date_part query to hit index?

查看:91
本文介绍了如何获取date_part查询以命中索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尚未能够使该查询命中索引而不是执行完整扫描-我还有另一个查询,它对几乎相同的表使用date_part('day',datelocal)(该表只有一点点)较少的数据,但结构相同),并且将命中我在datelocal列上创建的索引(这是没有时区的时间戳)。查询(此查询在表上执行并行seq扫描并执行内存快速排序):

I have yet to be able to get this query to hit an index instead of performing a full scan - I have another query that uses date_part('day', datelocal) against an almost identical table (that table just has a bit less data but same structure) and that one will hit the index I created on the datelocal column (which is a timestamp without timezone). Query (this one performs a parallel seq scan on the table and does a memory quicksort):

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

这是另一个命中我的本地日期索引的

Here is the other one that does hit my datelocal index:

SELECT
    date_part('day', datelocal) AS day,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)

这件事让我大吃一惊!关于如何加快第一个或至少使其达到索引的任何想法?我尝试在datelocal字段上创建索引,在datelocal,性别和视图上创建复合索引,并在date_part('hour',datelocal)上创建表达式索引,但是这些都没有用。

Banging my head about this! Any ideas as to how I can speed up the first one or at least make it hit an index? I've tried creating an index on the datelocal field, a compound index on datelocal, gender, and views, and an expression index on date_part('hour', datelocal) but none of that has worked.

模式:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);



-- Table Definition ----------------------------------------------

CREATE TABLE reportimpressionday (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);

解释(分析,缓冲)输出:

Explain (analyze, buffers) output:

Finalize GroupAggregate  (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  Buffers: shared hit=123912 read=823290
  I/O Timings: read=81228.280
  ->  Sort  (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        Buffers: shared hit=123912 read=823290
        I/O Timings: read=81228.280
        ->  Gather  (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=123912 read=823290
              I/O Timings: read=81228.280
              ->  Partial HashAggregate  (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    Buffers: shared hit=123912 read=823290
                    I/O Timings: read=81228.280
                    ->  Parallel Seq Scan on reportimpression  (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
                          Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
                          Rows Removed by Filter: 6792750
                          Buffers: shared hit=123912 read=823290
                          I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms


推荐答案

好吧,您的两个查询都在不同的表上( reportimpression reportimpressionday ),因此两个查询的比较实际上不是比较。你们都分析了吗?各种列统计信息也可能起作用。索引或表膨胀可能会有所不同。所有行中是否有较大一部分符合2019年2月的条件?

Well, both your queries are on different tables (reportimpression vs. reportimpressionday), so the comparison of the two queries really isn't a comparison. Did you ANALYZE both? Various column statistics also may play a role. Index or table bloat may be different. Does a larger part of all rows qualify for Feb 2019? Etc.

在黑暗中拍摄一张照片,比较两个表的百分比:

One shot in the dark, compare the percentages for both tables:

SELECT tbl, round(share * 100 / total, 2) As percentage
FROM  (
   SELECT text 'reportimpression' AS tbl
        , count(*)::numeric AS total
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
   FROM  reportimpression

   UNION ALL
   SELECT 'reportimpressionday'
        , count(*)
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
   FROM  reportimpressionday
  ) sub;

reportimpression 的那个更大吗?

通常,您的索引 reportimpression_datelocal_index (datelocal)看起来很不错,并且 reportimpression_viewership_index 甚至允许自动索引超过表的写负载的仅索引扫描。 (尽管印象和amp; agegroup 只是为此而已,如果没有它,效果会更好)。

Generally, your index reportimpression_datelocal_index on (datelocal) looks good for it, and reportimpression_viewership_index even allows index-only scans if autovacuum beats the write load on the table. (Though impressions & agegroup are just dead freight for this and it would work even better without).

您获得了 26.6%,一天是26.4% 用于我的查询。对于这么大的百分比, 索引通常根本没有用。顺序扫描通常是最快的方法。如果基础表更大,则仅索引扫描 仍然有意义。 (或者您有 severe 严重的表膨胀和较少的索引膨胀,这使索引再次更具吸引力。)

You got 26.6 percent, and day is 26.4 percent for my query. For such a large percentage, indexes are typically not useful at all. A sequential scan is typically the fastest way. Only index-only scans may still make sense if the underlying table is much bigger. (Or you have severe table bloat, and less bloated indexes, which makes indexes more attractive again.)

您的第一个查询可能刚刚临界点。尝试缩小时间范围,直到看到仅索引扫描。您不会看到(位图)索引扫描的合格行占总数的大约5%以上(取决于许多因素)。

Your first query may just be across the tipping point. Try narrowing the time frame until you see index-only scans. You won't see (bitmap) index scans with more then roughly 5 % of all rows qualifying (depends on many factors).

尽可能考虑以下修改后的查询:

Be that as it may, consider these modified queries:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01' -- '2019-02-28'  -- ?
GROUP  BY 1
ORDER  BY 1;

SELECT date_trunc('day', datelocal)                AS day
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpressionday
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01'
GROUP  BY 1
ORDER  BY 1;



要点




  • 当使用本地化的日期格式(如'2-1-2019')时,请通过 to_timestamp() 带有明确的格式说明符。否则,这取决于语言环境设置,并且从具有不同设置的会话中调用时可能会(无提示)中断。而是使用所示的ISO日期/时间格式,而不依赖于区域设置。

    Major points

    • When using localized date format like '2-1-2019', go through to_timestamp() with explicit format specifiers. Else this depends on locale settings and might break (silently) when called from a session with different settings. Rather use ISO date / time formats as demonstrated which do not depend on locale settings.

      看起来像您要包含整个月 2月。但是您的查询没有达到上限。一月中,二月可能有29天。 datelocal< 2019年2月28日 也不包括2月28日的全部时间。使用 datelocal<而不是 2019-03-01

      Looks like you want to include the whole month of February. But your query misses out on the upper bound. For one, February may have 29 days. An datelocal < '2-28-2019' excludes all of Feb 28 as well. Use datelocal < '2019-03-01' instead.

      分组并比较便宜如果可以的话,按与 SELECT 列表中相同的表达式进行排序。因此,在那里也使用 date_trunc()。无需使用其他表达式。如果您需要结果中的日期部分,请将其应用于分组表达式,例如:

      It's cheaper to group & sort by the same expression as you have in the SELECT list if you can. So use date_trunc() there, too. Don't use different expressions without need. If you need the datepart in the result, apply it on the grouped expression, like:

      SELECT date_part('day', date_trunc('day', datelocal)) AS day
      ...
      GROUP  BY date_trunc('day', datelocal)
      ORDER  BY date_trunc('day', datelocal);
      

      嘈杂的代码,但速度更快(也可能更容易针对查询计划程序进行优化) 。

      A bit more noisy code, but faster (and possibly easier to optimize for the query planner, too).

      使用Postgres 9.4或更高版本中的汇总 FILTER 子句。更干净,速度更快。请参阅:

      Use the aggregate FILTER clause in Postgres 9.4 or later. It's cleaner and a bit faster. See:

      • How can I simplify this game statistics query?
      • For absolute performance, is SUM faster or COUNT?

      这篇关于如何获取date_part查询以命中索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆