如何在相关子查询中的MySQL中计算移动平均值? [英] How to calculate a moving average in MySQL in a correlated subquery?

查看:246
本文介绍了如何在相关子查询中的MySQL中计算移动平均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个时间轴报告,该报告针对时间轴中的每个日期显示一个具有某些度量和度量日期的数据集中最近N个数据点的移动平均值.我有一个日历表,每天都在填充日历表以提供日期.我可以计算出一条时间表,用相关子查询相当简单地显示该日期之前的总体平均值(实际情况比这要复杂得多,但可以从本质上简化为这样):

SELECT  c.date
,       (   SELECT  AVERAGE(m.value) 
            FROM    measures as m
            WHERE   m.measured_on_dt <= c.date
        ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

我花了几天时间阅读有关此内容的信息,但没有找到任何好的解决方案.有人建议LIMIT可能在子查询中起作用(当前版本的MySQL的子查询中支持LIMIT),但是LIMIT适用于返回集,而不适用于进入聚合的行,因此添加它没有区别.

我也不能写一个带有LIMIT的非聚合SELECT,然后对其进行聚合,因为在FROM语句中不允许相关子查询.所以这(很难)行不通:

SELECT  c.date
,       SELECT AVERAGE(last_5.value)
        FROM (  SELECT  m.value
                FROM    measures as m
                WHERE   m.measured_on_dt <= c.date
                ORDER BY m.measured_on_dt DESC
                LIMIT 5
              ) as `last_5`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

我想我需要完全避免使用子查询方法,看看我是否使用具有用户变量的巧妙的连接/行编号技术来做到这一点,然后将其汇总,但是在我从事这一工作的同时,我想我会问是否有人知道更好的方法?

更新: 好的,我有一个解决方案,在此示例中已对其进行了简化.它依靠一些用户变量的技巧来将度量从日历日期向后编号.它还与日历表(而不是子查询)做叉积,但这具有导致行编号把戏失败的不幸副作用(用户变量在发送给客户端时进行评估,而不是在发送给客户端时进行评估.行已评估),因此要解决此问题,我必须将查询嵌套一级,对结果进行排序,然后将行编号技巧应用于该组,然后才能生效.

此查询仅返回可以进行度量的日历日期,因此,如果您想要整个时间轴,则只需选择日历并向该结果集左移JOIN.

set @day = 0;
set @num = 0;
set @LIMIT = 5;

SELECT  date
,       AVG(value) as recent_N_AVG
FROM
(  SELECT *
  ,      @num := if(@day = c.date, @num + 1, 1) as day_row_number
  ,      @day := day as dummy
  FROM 
  ( SELECT  c.full_date
    ,       m.value
    ,       m.measured_on_dt
    FROM    calendar c 
    JOIN    measures as m
    WHERE   m.measured_on_dt <= c.full_date
    AND     c.full_date BETWEEN date1 AND date2  
    ORDER BY c.full_date ASC, measured_on_dt DESC
  ) as full_data
) as numbered
WHERE day_row_number <= @LIMIT
GROUP BY date

行编号技巧可以推广到更复杂的数据(我的度量在几个维度上需要汇总).

解决方案

如果时间轴是连续的(每天1个值),您可以像这样改善首次尝试:

SELECT c.date,
       ( SELECT AVERAGE(m.value) 
         FROM   measures as m
         WHERE  m.measured_on_dt 
                    BETWEEN DATE_SUB(c.date, INTERVAL 5 day) AND c.date
       ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

如果您的时间轴上有空洞,那么平均值将小于5.

I want to create a timeline report that shows, for each date in the timeline, a moving average of the latest N data points in a data set that has some measures and the dates they were measured. I have a calendar table populated with every day to provide the dates. I can calculate a timeline to show the overall average prior to that date fairly simply with a correlated subquery (the real situation is much more complex than this, but it can essentially be simplified to this):

SELECT  c.date
,       (   SELECT  AVERAGE(m.value) 
            FROM    measures as m
            WHERE   m.measured_on_dt <= c.date
        ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

I've spent days reading around this and I've not found any good solutions. Some have suggested that LIMIT might work in the subquery (LIMIT is supported in subqueries the current version of MySQL), however LIMIT applies to the return set, not the rows going into the aggregate, so it makes no difference to add it.

Nor can I write a non-aggregated SELECT with a LIMIT and then aggregate over that, because a correlated subquery is not allowed inside a FROM statement. So this (sadly) WON'T work:

SELECT  c.date
,       SELECT AVERAGE(last_5.value)
        FROM (  SELECT  m.value
                FROM    measures as m
                WHERE   m.measured_on_dt <= c.date
                ORDER BY m.measured_on_dt DESC
                LIMIT 5
              ) as `last_5`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

I'm thinking I need to avoid the subquery approach completely and see if I do this with a clever join / row numbering technique with user-variables and then aggregate that but while I'm working on that I thought I'd ask if anyone knew a better method?

UPDATE: Okay, I've got a solution working which I've simplified for this example. It relies on some user-variable trickery to number the measures backwards from the calendar date. It also does a cross product with the calendar table (instead of a subquery) but this has the unfortunate side-effect of causing the row-numbering trick to fail (user-variables are evaluated when they're sent to the client, not when the row is evaluated) so to workaround this, I've had to nest the query one level, order the results and then apply the row-numbering trick to that set, which then works.

This query only returns calendar dates for which there are measures, so if you wanted the whole timeline you'd simply select the calendar and LEFT JOIN to this result set.

set @day = 0;
set @num = 0;
set @LIMIT = 5;

SELECT  date
,       AVG(value) as recent_N_AVG
FROM
(  SELECT *
  ,      @num := if(@day = c.date, @num + 1, 1) as day_row_number
  ,      @day := day as dummy
  FROM 
  ( SELECT  c.full_date
    ,       m.value
    ,       m.measured_on_dt
    FROM    calendar c 
    JOIN    measures as m
    WHERE   m.measured_on_dt <= c.full_date
    AND     c.full_date BETWEEN date1 AND date2  
    ORDER BY c.full_date ASC, measured_on_dt DESC
  ) as full_data
) as numbered
WHERE day_row_number <= @LIMIT
GROUP BY date

The row numbering trick can be generalised to more complex data (my measures are in several dimensions which need aggregating up).

解决方案

If your timeline is continuous (1 value each day) you could improve your first attempt like this:

SELECT c.date,
       ( SELECT AVERAGE(m.value) 
         FROM   measures as m
         WHERE  m.measured_on_dt 
                    BETWEEN DATE_SUB(c.date, INTERVAL 5 day) AND c.date
       ) as `average_to_date`
FROM    calendar c
WHERE   c.date between date1 AND date2  -- graph boundaries
ORDER BY c.date ASC

If your timeline has holes in it this would result in less than 5 values for the average.

这篇关于如何在相关子查询中的MySQL中计算移动平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆