如何使用窗口函数优化SQL查询 [英] How to optimize SQL query with window functions

查看:93
本文介绍了如何使用窗口函数优化SQL查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与问题有关.我有包含设备功耗值的表,我需要计算给定时间跨度的功耗并返回10个最耗电的设备.我已生成192个设备和7742208个测量记录(每个记录为40324).这大约是一个月内设备会产生多少记录.

This question is related to this one. I have table which contains power values for devices and I need to calculate power consumption for given time span and return 10 most power consuming devices. I have generated 192 devices and 7742208 measurement records (40324 for each). This is roughly how much records devices would produce in one month.

对于这样的数据量,我当前的查询要花费40多秒才能执行,这太过分了,因为时间跨度,设备数量和测量值可能更高.我是否应该尝试使用不同于lag()OVER PARTITION的方法来解决此问题,还可以进行其他哪些优化?我非常感谢代码示例的建议.

For this amount of data my current query takes over 40s to execute which is too much because time span and amount of devices and measurements could be much higher. Should I try to solve this with different approach than lag() OVER PARTITION and what other optimizations can be made? I would really appreciate suggestions with code examples.

PostgreSQL 9.4版

PostgreSQL version 9.4

查询示例值:

SELECT
  t.device_id,
  sum(len_y*(extract(epoch from len_x))) AS total_consumption
FROM (
    SELECT
      m.id,
      m.device_id,
      m.power_total,
      m.created_at,
      m.power_total+lag(m.power_total) OVER (
        PARTITION BY device_id
        ORDER BY m.created_at
      ) AS len_y,
      m.created_at-lag(m.created_at) OVER (
        PARTITION BY device_id
        ORDER BY m.created_at
      ) AS len_x
    FROM
      measurements AS m
  WHERE m.created_at BETWEEN '2015-07-30 13:05:24.403552+00'::timestamp
    AND '2015-08-27 12:34:59.826837+00'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total_consumption
DESC LIMIT 10;

表格信息:

    Column    |           Type           |                         Modifiers
--------------+--------------------------+----------------------------------------------------------
 id           | integer                  | not null default nextval('measurements_id_seq'::regclass)
 created_at   | timestamp with time zone | default timezone('utc'::text, now())
 power_total  | real                     |
 device_id    | integer                  | not null
Indexes:
    "measurements_pkey" PRIMARY KEY, btree (id)
    "measurements_device_id_idx" btree (device_id)
    "measurements_created_at_idx" btree (created_at)
Foreign-key constraints:
    "measurements_device_id_fkey" FOREIGN KEY (device_id) REFERENCES devices(id)

查询计划:

Limit  (cost=1317403.25..1317403.27 rows=10 width=24) (actual time=41077.091..41077.094 rows=10 loops=1)
->  Sort  (cost=1317403.25..1317403.73 rows=192 width=24) (actual time=41077.089..41077.092 rows=10 loops=1)
Sort Key: (sum((((m.power_total + lag(m.power_total) OVER (?))) * date_part('epoch'::text, ((m.created_at - lag(m.created_at) OVER (?)))))))
Sort Method: top-N heapsort  Memory: 25kB
->  GroupAggregate  (cost=1041700.67..1317399.10 rows=192 width=24) (actual time=25361.013..41076.562 rows=192 loops=1)
Group Key: m.device_id
->  WindowAgg  (cost=1041700.67..1201314.44 rows=5804137 width=20) (actual time=25291.797..37839.727 rows=7742208 loops=1)
->  Sort  (cost=1041700.67..1056211.02 rows=5804137 width=20) (actual time=25291.746..30699.993 rows=7742208 loops=1)
Sort Key: m.device_id, m.created_at
Sort Method: external merge  Disk: 257344kB
->  Seq Scan on measurements m  (cost=0.00..151582.05 rows=5804137 width=20) (actual time=0.333..5112.851 rows=7742208 loops=1)
Filter: ((created_at >= '2015-07-30 13:05:24.403552'::timestamp without time zone) AND (created_at <= '2015-08-27 12:34:59.826837'::timestamp without time zone))

Planning time: 0.351 ms
Execution time: 41114.883 ms

查询以生成测试表和数据:

Query to generate test table and data:

CREATE TABLE measurements (
    id          serial primary key,
    device_id   integer,
    power_total real,
    created_at  timestamp
);

INSERT INTO measurements(
    device_id,
    created_at,
    power_total
  )
SELECT
  device_id,
  now() + (i * interval '1 minute'),
  random()*(50-1)+1
FROM (
  SELECT
    DISTINCT(device_id),
    generate_series(0,10) AS i
 FROM (
  SELECT
    generate_series(1,5) AS device_id
  ) AS dev_ids
) AS gen_table;

推荐答案

我会尝试将部分计算移至行插入阶段.

I would try to move some part of the calculations into the phase of row insertion.

添加新列:

alter table measurements add consumption real;

更新列:

with m1 as (
    select
        id, power_total, created_at,
        lag(power_total) over (partition by device_id order by created_at) prev_power_total,
        lag(created_at) over (partition by device_id order by created_at) prev_created_at
    from measurements
    )
update measurements m2
set consumption = 
    (m1.power_total+ m1.prev_power_total)*
    extract(epoch from m1.created_at- m1.prev_created_at)
from m1
where m2.id = m1.id;

创建触发器:

create or replace function before_insert_on_measurements()
returns trigger language plpgsql
as $$
declare
    rec record;
begin
    select power_total, created_at into rec
    from measurements
    where device_id = new.device_id
    order by created_at desc
    limit 1;
    new.consumption:= 
        (new.power_total+ rec.power_total)*
        extract(epoch from new.created_at- rec.created_at);
    return new;
end $$;

create trigger before_insert_on_measurements
before insert on measurements
for each row execute procedure before_insert_on_measurements();

查询:

select device_id, sum(consumption) total_consumption
from measurements
-- where conditions
group by 1
order by 1

这篇关于如何使用窗口函数优化SQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆