窗口函数通过当前行过滤 [英] Window functions filter through current row

查看:118
本文介绍了窗口函数通过当前行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对这个问题的跟进,我的查询改进为使用窗口函数,而不是 LATERAL 联接内的聚合。虽然查询现在快得多了,但我发现结果不正确。

This is a follow-up to this question, where my query was improved to use window functions instead of aggregates inside a LATERAL join. While the query is now much faster, I've found that the results are not correct.

我需要在x年的跟踪时间范围内进行计算。例如, price_to_maximum_earnings 是通过将十年前的 max(收入)移到当前行来计算的,并且将价格除以结果。为了简单起见,我们在这里使用1年。

I need to perform computations on x year trailing time frames. For example, price_to_maximum_earnings is computed per row by getting max(earnings) over ten years ago to the current row, and dividing price by the result. We'll use 1 year for simplicity here.

SQL小提琴。 (Postgres 9.6)

SQL Fiddle for this question. (Postgres 9.6)

举一个简单的例子,价格 peak_earnings 2010-01-01 的$ c>可以像这样单独计算:

As a simple example, price and peak_earnings for 2010-01-01 could be computed separately like this:

SELECT price
FROM security_data
WHERE date = '2010-01-01'
AND security_id = 'SPX';

SELECT max(earnings) AS min_earnings
FROM bloomberg.security_data
WHERE date >= '2000-01-01'
AND date <= '2010-01-01'
AND security_id = 'SPX';

为此,每行 ,我使用以下代码:

To do this per row, I use the following:

SELECT security_id, date, price
     , CASE WHEN date1 >= min_date
            THEN price / NULLIF(max(earnings) FILTER (WHERE date >= date1) OVER w, 0) END AS price_to_peak_earnings
FROM
(
  SELECT record_id, security_id, price, date, earnings
           , (date - interval '1 y')::date AS date1
           , min(date) OVER (PARTITION BY security_id) AS min_date
      FROM   security_data
) d
WINDOW w AS (PARTITION BY security_id);

我相信这里的问题源于 FILTER ,因为它似乎没有按照我的意愿运行。请注意,在链接的SQL Fiddle中,我显示了 FILTER 的结果,并且对于每行 peak_earnings minimum_earnings 只是整个数据集的最大值和最小值。 应该是从一年前到当前行的收入的最大值/最小值。

I believe the issue here stems from the use of FILTER, as it doesn't seem to be working as I want it to. Note that in the linked SQL Fiddle, I've displayed the result of the FILTER, and for each row the peak_earnings and minimum_earnings are just the max and min for the entire data set. They should be the max/min values of earnings from 1 year ago to the current row.

这是怎么回事?我从这个问题的答案中知道不能简单地说 FILTER(日期> = date1和日期< = current_row.date),所以我找不到解决方案吗?我无法使用窗口框架,因为在给定的时间范围内行数不确定,因此我不能只说(365个行和当前行之间的行)。我可以使用框架过滤器吗?那可能会超过一年,然后筛选器可以捕获每个无效日期。我已经尝试过了,但是没有成功。

What's going on here? I know from the answer to this question that I can't simply say FILTER (WHERE date >= date1 AND date <= current_row.date), so is there a solution I'm missing? I cannot use window frames because I have an uncertain number of rows for any given time frame, so I couldn't just say OVER (ROWS BETWEEN 365 PRECEDING AND CURRENT ROW). Can I use a frame and a filter? That might get over a year previous, then the filter could catch every invalid date. I've tried this but have not been successful.

推荐答案


我可以使用框架和过滤器吗?

Can I use a frame and a filter?

可以。但这两个都有限制:

You can. But either has restrictions:


  • 仅在 FILTER 子句中的表达式看到它获取值的相应行。无法引用窗口函数为其计算值的行。因此,除非我们进行巨大,昂贵的交叉连接,否则我看不到一种根据那个行来构造过滤器的方法-同一行用于许多不同的计算。或者我们回到可以引用父行的 LATERAL 子查询。

  • The expression in the FILTER clause only sees the respective row where it fetches values. There is no way to reference the row for which your window function computes values. So I don't see a way to formulate a filter depending on that row unless we make a huge, expensive cross join - the same row is used for many different computations. Or we are back to LATERAL subqueries that can reference the parent row.

另一方面,帧定义根本不允许变量。如您所引用的相关答案中所述,它需要一个固定的数字:

The frame definition on the other hand does not allow variables at all. It demands a fixed number, as discussed in the related answer you referenced:

  • Referencing current row in FILTER clause of window function

这些限制使您的特定查询难以实施。现在应该是正确

These restrictions make your particular query hard to implement. This should be correct now:

SELECT *
FROM  (
   SELECT record_id, security_id, date, price
        , CASE WHEN do_calc THEN                max(earnings) OVER w1     END AS peak_earnings
        , CASE WHEN do_calc THEN                min(earnings) OVER w1     END AS minimum_earnings
        , CASE WHEN do_calc THEN price / NULLIF(max(earnings) OVER w1, 0) END AS price_to_peak_earnings
        , CASE WHEN do_calc THEN price / NULLIF(min(earnings) OVER w1, 0) END AS price_to_minimum_earnings
   FROM  (
      SELECT *, (date - 365) >= min_date AND s.record_id IS NOT NULL AS do_calc
      FROM  (
         SELECT security_id, min_date
              , generate_series(min_date, max_date, interval '1 day')::date AS date
         FROM  (
            SELECT security_id, min(date) AS min_date, max(date) AS max_date
            FROM   security_data
            GROUP  BY 1
            ) minmax
         ) d
      LEFT   JOIN  security_data s USING (security_id, date)
      ) sub1
   WINDOW w1 AS (PARTITION BY security_id ORDER BY date ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING)
   ) sub2
WHERE  record_id IS NOT NULL 
ORDER  BY 1, 2;

SQL小提琴。


  • 问题中没有任何内容表明每个 security_id 将在同一天有行。计算子查询 minmax 中每个 security_id 的最小/最大日期为我们提供了最小时间范围。

  • Nothing in the question says that every security_id would have rows for the same days. Calculating min / max date per security_id in subquery minmax give us the minimum time frame.

计算时间正好在该行当前日期之前的365天,并且包括当前行( ROWS BETWEEN 365前一个和前一个)。 从聚合中排除/与当前行进行比较通常更为有用。

我将计算条件调整到了相同的时间范围,以避免出现极端情况: (日期-365)> =分钟日期

The time frame for calculations is exactly 365 day preceding the current date of the row and not including the current row (ROWS BETWEEN 365 PRECEDING AND 1 PRECEDING). It's typically more useful to exclude the current row from aggregations to be compared with the current row.
I adapted the condition for calculations to the same time frame to avoid corner case oddities: (date - 365) >= min_date

小提琴,在1月1日的每一行中添加了1行,您可以看到year年的效果与固定数量的365天。 leap年(2001,2005,...)之后,窗口框架为空。

In the fiddle, where you added 1 row for every 1st of Jan, you can see the effect of leapyears contrasting with a fixed number of 365 day. The window frame is empty after leapyears (2001, 2005, ...).

我正在使用所有子查询,通常比CTE快一点。

I am using all subqueries, which is typically a bit faster than CTEs.

可以肯定的是,我们需要在框架定义中包含 ORDER BY 。我更新了与之对应的旧答案,

To be sure, we need to include ORDER BY in the frame definition. I updated my old answer you linked to accordingly:

  • Referencing current row in FILTER clause of window function

在 1年期间,我使用 w1 作为窗口 name 。您可以添加 w2 等,并且每个日期可以有任意天数。毕竟,如果需要,您可以适应leap年。甚至可能根据当前日期生成整个查询...

I use w1 as window name, for the "1 year" period. You might add w2, etc. and can have any number of days for each. You could adapt to leapyears after all if you should need to. Might even generate the whole query depending on the current date ...

这篇关于窗口函数通过当前行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆