PostgreSQL 50M+行表的聚合查询 [英] Aggregate query on 50M+ row table in PostgreSQL

查看:62
本文介绍了PostgreSQL 50M+行表的聚合查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下定义的表event_statistics":

I have table "event_statistics" with following definition:

CREATE TABLE public.event_statistics (
    id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass),
    client_id int4 NULL,
    session_id int4 NULL,
    action_name text NULL,
    value text NULL,
    product_id int8 NULL,
    product_options jsonb NOT NULL DEFAULT '{}'::jsonb,
    url text NULL,
    url_options jsonb NOT NULL DEFAULT '{}'::jsonb,
    visit int4 NULL DEFAULT 0,
    date_update timestamptz NULL,
CONSTRAINT event_statistics_pkey PRIMARY KEY (id),
CONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for 
FOREIGN KEY 
(client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE
)
WITH (
    OIDS=FALSE
) ;
CREATE INDEX regdate ON public.event_statistics (date_update 
timestamptz_ops) ;

和表客户":

CREATE TABLE public.clients (
    id int4 NOT NULL DEFAULT nextval('clients_id_seq'::regclass),
    client_name text NULL,
    client_hash text NULL,
CONSTRAINT clients_pkey PRIMARY KEY (id)
)
WITH (
    OIDS=FALSE
) ;
CREATE INDEX clients_client_name_idx ON public.clients (client_name 
text_ops) ;

我需要的是获取每个action_name"类型的event_statistics"表中的事件计数,用于按action_name"和特定时间步长分组的特定date_update"范围以及特定客户端的所有这些.

What I need is to get count of events in "event_statistics" table of each "action_name" type for specific "date_update" range grouping by "action_name" and specific time step and all of that for specific client.

我们的目标是在我们网站的仪表板上为每个客户提供所有相关事件的统计数据,并可选择报告日期,并且图表中的间隔时间步长应该不同,例如:

The goal is to provide statistics for all relevant events for each client on his dashboard on our web site with option to choose report dates and depending on interval time step in chart should be different like:

  • 当前日期——每小时计数;
  • 1+ 天和 <= 1 个月 — 计算每一天;
  • 1+ 个月和 <= 6 个月 — 每周计数;
  • 6 个月以上 — 一个月.
SELECT t.date, A.actionName, count(E.id)
FROM generate_series(current_date - interval '1 week',now(),interval '1 
day') as t(date) cross join
(values
('page_open'),
('product_add'),
('product_buy'),
('product_event'),
('product_favourite'),
('product_open'),
('product_share'),
('session_start')) as A(actionName) left join
(select action_name,date_trunc('day',e.date_update) as dateTime, e.id 
from event_statistics as e 
where e.client_id = (select id from clients as c where c.client_name = 
'client name') and 
(date_update between (current_date - interval '1 week') and now())) E 
on t.date = E.dateTime and A.actionName = E.action_name
group by A.actionName,t.date
order by A.actionName,t.date;

按事件类型和上周的日期计算事件花费的时间太长,超过 10 秒.我需要它能够更快地在更宽的时间段内执行相同的操作,例如具有不同组间隔的周、月、年(当天的每一小时、每月的几天、然后是几周、几个月).

It takes too long, more than 10 seconds, to count events by event type and day for last week. I need it to be able to do the same faster and for wider periods of time like weeks, months, years with different group interval(each hour of current day, days for month, then weeks, months).

查询计划:

GroupAggregate  (cost=171937.16..188106.84 rows=1600 width=44)
  Group Key: "*VALUES*".column1, t.date
  InitPlan 1 (returns $0)
    ->  Seq Scan on clients c  (cost=0.00..1.07 rows=1 width=4)
          Filter: (client_name = 'client name'::text)
  ->  Merge Left Join  (cost=171936.08..183784.31 rows=574060 width=44)
        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc('day'::text, e.date_update))))
        ->  Sort  (cost=628.77..648.77 rows=8000 width=40)
              Sort Key: "*VALUES*".column1, t.date
              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40)
                    ->  Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8)
                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32)
                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32)
        ->  Materialize  (cost=171307.32..171881.38 rows=114812 width=24)
              ->  Sort  (cost=171307.32..171594.35 rows=114812 width=24)
                    Sort Key: e.action_name, (date_trunc('day'::text, e.date_update))
                    ->  Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24)
                          Index Cond: ((date_update > (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= now()))
                          Filter: (client_id = $0)

event_statistics"表有超过 5000 万行,它只会随着客户的增加而增长,而记录不会改变.

The "event_statistics" table has more than 50 millions of rows and it will only grow with clients added and the records will not be changed.

我尝试了很多不同的查询计划和索引,但在针对更广泛的日期范围聚合时无法达到可接受的速度.我花了整整一周的时间学习这个问题的不同方面以及在 stackoverflow 和一些博客上解决这个问题的方法,但仍然不确定什么是最好的方法:

I've tried a lot of different query plans and indexes but couldn't reach acceptable speed when aggregated for wider date range. I've spent whole week learning different aspects of this problem and ways to solve this on stackoverflow and some blogs, but still unsure what is the best way:

  • 按 client_id 或日期范围分区
  • 预先聚合到单独的结果表,然后每天更新它(也不知道如何做到最好?在插入原始表时触发或为该视图或物化视图安排单独的应用程序,或根据来自网站)
  • 将数据库架构设计更改为每个客户端的架构或应用分片
  • 更改服务器硬件(CPU Intel Xeon E7-4850 2.00GHz,RAM 6GB,它是 Web 应用程序和数据库的主机)
  • 使用不同的数据库进行具有 OLAP 功能的分析,例如 Postgres-XL还是别的什么?

我还在 event_statistics (client_id asc, action_name asc, date_update asc, id) 上尝试了 btree 索引.仅索引扫描速度更快,但仍然不够,并且在磁盘空间使用方面也不是很好.

I also tried btree index on event_statistics (client_id asc, action_name asc, date_update asc, id). And it was faster with index only scan, but still not enough and it's not very good in terms of disk space usage.

解决此问题的最佳方法是什么?

根据要求,explain (analyze,verbose) 命令的输出:

As requested, the output of explain (analyze, verbose) command:

GroupAggregate  (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1)
  Output: t.date, "*VALUES*".column1, count(e.id)
  Group Key: "*VALUES*".column1, t.date
  InitPlan 1 (returns $0)
    ->  Seq Scan on public.clients c  (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1)
          Output: c.id
          Filter: (c.client_name = 'client name'::text)
          Rows Removed by Filter: 5
  ->  Merge Left Join  (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1)
        Output: t.date, "*VALUES*".column1, e.id
        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc('day'::text, e.date_update))))
        ->  Sort  (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1)
              Output: t.date, "*VALUES*".column1
              Sort Key: "*VALUES*".column1, t.date
              Sort Method: quicksort  Memory: 30kB
              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1)
                    Output: t.date, "*VALUES*".column1
                    ->  Function Scan on pg_catalog.generate_series t  (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1)
                          Output: t.date
                          Function Call: generate_series(((('now'::cstring)::date - '7 days'::interval))::timestamp with time zone, now(), '1 day'::interval)
                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8)
                          Output: "*VALUES*".column1
                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1)
                                Output: "*VALUES*".column1
        ->  Materialize  (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1)
              Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update))
              ->  Sort  (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1)
                    Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update))
                    Sort Key: e.action_name, (date_trunc('day'::text, e.date_update))
                    Sort Method: external merge  Disk: 39080kB
                    ->  Index Scan using regdate on public.event_statistics e  (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1)
                          Output: e.id, e.date_update, e.action_name, date_trunc('day'::text, e.date_update)
                          Index Cond: ((e.date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (e.date_update <= now()))
                          Filter: (e.client_id = $0)
                          Rows Removed by Filter: 2983424
Planning time: 7.278 ms
Execution time: 54708.041 ms

推荐答案

第一步:在子查询中进行预聚合:

First step: perform pre-aggregation in the subquery:

EXPLAIN
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1 
day') as cal(theday) -- calendar pseudo-table
CROSS JOIN (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) AS act(action_name)
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) as theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics as es
        WHERE es.client_id = (SELECT c.id FROM clients AS c
                        WHERE c.client_name = 'client name')
        AND (es.date_update BETWEEN (current_date - interval '1 week') AND now())
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;

<小时>

下一步:将 VALUES 放入 CTE 并在聚合子查询中引用它.(增益取决于可以跳过的动作名称的数量)


Next step:put VALUES into a CTE and refer to it also in the aggregate subquery. (the gain depends on the number of action names that can be skipped)

EXPLAIN
WITH act(action_name) AS (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        )
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1day') AS cal(theday)
CROSS JOIN act
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name)
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;

<小时>

更新:使用财务(临时)表将导致更好的估计.


UPDATE: using a fysical (temp) table will result in better estimates.

    -- Final attempt: materialize the carthesian product (timeseries*action_name)
    -- into a temp table
CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );
CREATE UNIQUE INDEX ON grid(action_name, theday);

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT grid.action_name, grid.theday, SUM(sub.the_count) AS the_count
FROM grid
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(*) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name)
        GROUP BY 1,2
        ORDER BY 1,2 --nonsense!
        ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name
GROUP BY grid.action_name,grid.theday
ORDER BY grid.action_name,grid.theday
        ;

<小时>

更新#3(抱歉,我在这里在基表上创建索引,您需要编辑.我还删除了时间戳上的一列)


Update#3 (sorry, I create indexes on the base table(s)here, You'll need to edit. I also removed the one-columns onthetimestamp)

    -- attempt#4:
    -- - materialize the carthesian product (timeseries*action_name)
    -- - sanitize date interval -logic

CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday::date
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
-- CREATE UNIQUE INDEX ON grid(action_name, theday);
-- CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
CREATE UNIQUE INDEX ON grid(theday, action_name);
CREATE INDEX iii ON event_statistics (date_update, action_name) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT gr.action_name, gr.theday
            , COUNT(*) AS the_count
FROM grid gr
LEFT JOIN event_statistics AS es
    ON es.action_name = gr.action_name
    AND date_trunc('day',es.date_update)::date = gr.theday
    AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date
JOIN clients cli  ON cli.id= es.client_id AND cli.client_name = 'client name'
GROUP BY gr.action_name,gr.theday
ORDER BY 1,2
        ;

<小时>

                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8.33..8.35 rows=1 width=17)
   Group Key: gr.action_name, gr.theday
   ->  Sort  (cost=8.33..8.34 rows=1 width=17)
         Sort Key: gr.action_name, gr.theday
         ->  Nested Loop  (cost=1.40..8.33 rows=1 width=17)
               ->  Nested Loop  (cost=1.31..7.78 rows=1 width=40)
                     Join Filter: (es.client_id = cli.id)
                     ->  Index Scan using clients_client_name_key on clients cli  (cost=0.09..2.30 rows=1 width=4)
                           Index Cond: (client_name = 'client name'::text)
                     ->  Bitmap Heap Scan on event_statistics es  (cost=1.22..5.45 rows=5 width=44)
                           Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
                           ->  Bitmap Index Scan on iii  (cost=0.00..1.22 rows=5 width=0)
                                 Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
               ->  Index Only Scan using grid_theday_action_name_idx on grid gr  (cost=0.09..0.54 rows=1 width=17)
                     Index Cond: ((theday = (date_trunc('day'::text, es.date_update))::date) AND (action_name = es.action_name))
(15 rows)

这篇关于PostgreSQL 50M+行表的聚合查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆