优化LATERAL联接中的慢速聚合 [英] Optimize slow aggregates in LATERAL join
问题描述
在我的PostgreSQL 9.6.2数据库中,我有一个查询,该查询根据一些库存数据构建一个计算字段的表.它为表中的每一行计算1到10年的移动平均窗口,并将其用于周期性调整.特别是CAPE,CAPB,CAPC,CAPS和CAPD.
In my PostgreSQL 9.6.2 database, I've got a query that builds a table of calculated fields from some stock data. It calculates a moving average window of 1 through 10 years for each row in the table, and uses that in cyclical adjustments. CAPE, CAPB, CAPC, CAPS, and CAPD, specifically.
即,对于每一行,您都要计算过去1到10年的avg(earnings)
,然后对其他几个变量进行相同的计算.
I.e., for each row, you calculate avg(earnings)
for the past 1 through 10 years, then do the same for several other variables.
我目前正在使用横向联接来计算每一行的聚合,但是事情非常慢,而且我不确定如何加快速度,是否在建立索引/重写查询等.
I'm currently using a lateral join to compute the aggregates for each row, but things are incredibly slow and I'm not entirely sure how to speed it up, whether it be indexing / rewriting the query, etc.
例如,当我对查询进行分层以仅包含约25,000行时,运行它需要15分钟,这似乎太慢了. (AWS免费套餐上的RDS)
For instance when I stratify the query to only include ~ 25k rows, it takes 15 minutes to run which seems way too slow. (RDS on AWS Free Tier)
-- Initialize Cyclical Adjustments
-- This query populates the database with numerous peak/min and CAPE type calculations.
-- We do this by selecting each valid row within the table by security then laterally
-- selecting the calculations for each of those rows. 'Valid' rows are determined by
-- date calculations that make sure every field that has insufficient data behind it
-- (several queries want 5+ years of time series data) is filled with NULL to avoid
-- inaccuracies.
WITH earliest_point AS (
SELECT
security_id,
min(date) as min_date
FROM bloomberg.security_data
GROUP BY security_id
)
SELECT
rec.record_id,
rec.security_id,
date,
-- Each of these cases decides if we have enough data in the database to populate the field. If there are at least
-- x years in the database (where x = 1:10) we do the price / aggregate computation. Otherwise, we shortcut to NULL.
-- NOTE: The NULLIF prevents us from dividing by zero.
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_earnings, 0) ELSE NULL END AS price_to_peak_earnings,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_earnings, 0) ELSE NULL END AS price_to_minimum_earnings,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_book, 0) ELSE NULL END AS price_to_peak_book,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_book, 0) ELSE NULL END AS price_to_minimum_book,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_sales, 0) ELSE NULL END AS price_to_peak_sales,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_sales, 0) ELSE NULL END AS price_to_minimum_sales,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_cashflow, 0) ELSE NULL END AS price_to_peak_cashflow,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_cashflow, 0) ELSE NULL END AS price_to_minimum_cashflow,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_dividends, 0) ELSE NULL END AS price_to_peak_dividends,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_dividends, 0) ELSE NULL END AS price_to_minimum_dividends,
CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL THEN price / NULLIF(ru.cap1_avg_earnings, 0) ELSE NULL END AS cape1,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_earnings, 0) ELSE NULL END AS cape2,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_earnings, 0) ELSE NULL END AS cape2,
CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL THEN price / NULLIF(ru.cap3_avg_earnings, 0) ELSE NULL END AS cape3,
CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL THEN price / NULLIF(ru.cap4_avg_earnings, 0) ELSE NULL END AS cape4,
CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL THEN price / NULLIF(ru.cap5_avg_earnings, 0) ELSE NULL END AS cape5,
CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL THEN price / NULLIF(ru.cap6_avg_earnings, 0) ELSE NULL END AS cape6,
CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL THEN price / NULLIF(ru.cap7_avg_earnings, 0) ELSE NULL END AS cape7,
CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL THEN price / NULLIF(ru.cap8_avg_earnings, 0) ELSE NULL END AS cape8,
CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL THEN price / NULLIF(ru.cap9_avg_earnings, 0) ELSE NULL END AS cape9,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_earnings, 0) ELSE NULL END AS cape10,
CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL THEN price / NULLIF(ru.cap1_avg_book, 0) ELSE NULL END AS capb1,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_book, 0) ELSE NULL END AS capb2,
CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL THEN price / NULLIF(ru.cap3_avg_book, 0) ELSE NULL END AS capb3,
CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL THEN price / NULLIF(ru.cap4_avg_book, 0) ELSE NULL END AS capb4,
CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL THEN price / NULLIF(ru.cap5_avg_book, 0) ELSE NULL END AS capb5,
CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL THEN price / NULLIF(ru.cap6_avg_book, 0) ELSE NULL END AS capb6,
CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL THEN price / NULLIF(ru.cap7_avg_book, 0) ELSE NULL END AS capb7,
CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL THEN price / NULLIF(ru.cap8_avg_book, 0) ELSE NULL END AS capb8,
CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL THEN price / NULLIF(ru.cap9_avg_book, 0) ELSE NULL END AS capb9,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_book, 0) ELSE NULL END AS capb10,
CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL THEN price / NULLIF(ru.cap1_avg_sales, 0) ELSE NULL END AS caps1,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_sales, 0) ELSE NULL END AS caps2,
CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL THEN price / NULLIF(ru.cap3_avg_sales, 0) ELSE NULL END AS caps3,
CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL THEN price / NULLIF(ru.cap4_avg_sales, 0) ELSE NULL END AS caps4,
CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL THEN price / NULLIF(ru.cap5_avg_sales, 0) ELSE NULL END AS caps5,
CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL THEN price / NULLIF(ru.cap6_avg_sales, 0) ELSE NULL END AS caps6,
CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL THEN price / NULLIF(ru.cap7_avg_sales, 0) ELSE NULL END AS caps7,
CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL THEN price / NULLIF(ru.cap8_avg_sales, 0) ELSE NULL END AS caps8,
CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL THEN price / NULLIF(ru.cap9_avg_sales, 0) ELSE NULL END AS caps9,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_sales, 0) ELSE NULL END AS caps10,
CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL THEN price / NULLIF(ru.cap1_avg_cashflow, 0) ELSE NULL END AS capc1,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_cashflow, 0) ELSE NULL END AS capc2,
CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL THEN price / NULLIF(ru.cap3_avg_cashflow, 0) ELSE NULL END AS capc3,
CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL THEN price / NULLIF(ru.cap4_avg_cashflow, 0) ELSE NULL END AS capc4,
CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL THEN price / NULLIF(ru.cap5_avg_cashflow, 0) ELSE NULL END AS capc5,
CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL THEN price / NULLIF(ru.cap6_avg_cashflow, 0) ELSE NULL END AS capc6,
CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL THEN price / NULLIF(ru.cap7_avg_cashflow, 0) ELSE NULL END AS capc7,
CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL THEN price / NULLIF(ru.cap8_avg_cashflow, 0) ELSE NULL END AS capc8,
CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL THEN price / NULLIF(ru.cap9_avg_cashflow, 0) ELSE NULL END AS capc9,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_cashflow, 0) ELSE NULL END AS capc10,
CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL THEN price / NULLIF(ru.cap1_avg_dividends, 0) ELSE NULL END AS capd1,
CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL THEN price / NULLIF(ru.cap2_avg_dividends, 0) ELSE NULL END AS capd2,
CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL THEN price / NULLIF(ru.cap3_avg_dividends, 0) ELSE NULL END AS capd3,
CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL THEN price / NULLIF(ru.cap4_avg_dividends, 0) ELSE NULL END AS capd4,
CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL THEN price / NULLIF(ru.cap5_avg_dividends, 0) ELSE NULL END AS capd5,
CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL THEN price / NULLIF(ru.cap6_avg_dividends, 0) ELSE NULL END AS capd6,
CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL THEN price / NULLIF(ru.cap7_avg_dividends, 0) ELSE NULL END AS capd7,
CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL THEN price / NULLIF(ru.cap8_avg_dividends, 0) ELSE NULL END AS capd8,
CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL THEN price / NULLIF(ru.cap9_avg_dividends, 0) ELSE NULL END AS capd9,
CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_dividends, 0) ELSE NULL END AS capd10
FROM bloomberg.security_data rec
-- Include the earliest point we have for this security in the record
JOIN earliest_point ep ON ep.security_id = rec.security_id,
-- LATERAL SELECT is executed for each row in the above query, with the row (rec) as a parameter
LATERAL
(
SELECT
-- Price to Peak/Minimum <field> calculations
max(earnings) AS max_earnings,
min(earnings) AS min_earnings,
max(book) AS max_book,
min(book) AS min_book,
max(sales) AS max_sales,
min(sales) AS min_sales,
max(cashflow) AS max_cashflow,
min(cashflow) AS min_cashflow,
max(dividends) AS max_dividends,
min(dividends) AS min_dividends,
-- Each of the following computes the aggregates for the
-- CAPE/B/S/C/D cyclical adjustments.
avg(earnings) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_dividends,
avg(earnings) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_earnings,
avg(book) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_book,
avg(sales) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_sales,
avg(cashflow) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_cashflow,
avg(dividends) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_dividends,
avg(earnings) AS cap10_avg_earnings,
avg(book) AS cap10_avg_book,
avg(sales) AS cap10_avg_sales,
avg(cashflow) AS cap10_avg_cashflow,
avg(dividends) AS cap10_avg_dividends
FROM bloomberg.security_data DATA
WHERE security_id = rec.security_id
AND date >= rec.date - '10 years'::interval
AND date <= rec.date
) ru;
当我是PostgreSQL的新手时,任何关于如何更快地实现这一想法的想法将不胜感激.
Any ideas on how to make this faster would be greatly appreciated as I'm new to PostgreSQL.
以下是数据库设置供参考:
Here is the database setup for reference:
CREATE SCHEMA bloomberg;
CREATE TABLE bloomberg.securities (
security_id character varying(45) PRIMARY KEY,
name_short character varying(45) NOT NULL,
name character varying(45) NOT NULL,
name_security character varying(45) NOT NULL
);
CREATE TABLE bloomberg.security_data (
record_id bigserial PRIMARY KEY,
date date NOT NULL,
security_id character varying(45) NOT NULL,
price double precision,
total_return double precision,
earnings double precision,
book double precision,
sales double precision,
cashflow double precision,
dividends double precision,
CONSTRAINT security_id FOREIGN KEY (security_id)
REFERENCES bloomberg.securities (security_id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
);
CREATE INDEX security_data_data on bloomberg.security_data (date);
CREATE INDEX security_data_security_id on bloomberg.security_data (security_id);
推荐答案
这应该是带有LATERAL
子查询的更快的变体.未经测试.
This should be a faster variant with LATERAL
subqueries. Untested.
SELECT s.record_id, s.security_id, s.date
, s.price / l.pmax AS price_to_peak_earnings
, s.price / l.pmin AS price_to_minimum_earnings
-- , ...
, s.price / l.cape1 AS cape1
, s.price / l.cape2 AS cape2
-- , ...
, s.price / l.cape10 AS cape10
, s.price / l.capb1 AS capb1
, s.price / l.capb2 AS capb2
-- , ...
, s.price / l.capb10 AS capb10
-- , ...
FROM (
SELECT *
, (date - interval '1 y')::date AS date1
, (date - interval '2 y')::date AS date2
-- ...
, (date - interval '10 y')::date AS date10
FROM (
SELECT *, min(date) OVER (PARTITION BY security_id) AS min_date
FROM security_data
) s1
) s
LEFT JOIN LATERAL (
SELECT CASE WHEN s.date10 >= s.min_date THEN NULLIF(max(earnings) , 0) END AS pmax
, CASE WHEN s.date10 >= s.min_date THEN NULLIF(min(earnings) , 0) END AS pmin
-- ...
, NULLIF(avg(earnings) FILTER (WHERE date >= s.date1), 0) AS cape1 -- no case
, CASE WHEN s.date2 >= s.min_date THEN NULLIF(avg(earnings) FILTER (WHERE date >= s.date2), 0) END AS cape2
-- ...
, CASE WHEN s.date10 >= s.min_date THEN NULLIF(avg(earnings) , 0) END AS cape10 -- no filter
, NULLIF(avg(book) FILTER (WHERE date >= s.date1), 0) AS capb1
, CASE WHEN s.date2 >= s.min_date THEN NULLIF(avg(book) FILTER (WHERE date >= s.date2), 0) END AS capb2
-- ...
, CASE WHEN s.date10 >= s.min_date THEN NULLIF(avg(book) , 0) END AS capb10
-- ...
FROM security_data
WHERE security_id = s.security_id
AND date >= s.date10
AND date < s.date
) l ON s.date1 >= s.min_date -- no computations if < 1 year of trailing data
ORDER BY s.security_id, s.date;
由于每一行都需要多个单独的聚合,因此它的运行速度仍然不会很快.这里的瓶颈将是CPU.
It's still not going to be blazingly fast, since every row needs multiple separate aggregations. The bottleneck here will be CPU.
另请参阅后续方法(加入生成的日历+窗口函数):
Also see the follow up with an alternative approach (JOIN to generated calendar + window functions):
这篇关于优化LATERAL联接中的慢速聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!