在MySQL中找到时间序列数据中的差距的方法? [英] Method of finding gaps in time series data in MySQL?

查看:240
本文介绍了在MySQL中找到时间序列数据中的差距的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个数据库表,它有两列,entry_time和value。 entry_time是timestamp,而value可以是任何其他数据类型。记录相对一致,以大约x分钟为间隔输入。然而,对于许多x的时间,可能不会产生条目,因此在数据中产生间隙。



在效率方面,使用查询来找到至少时间Y(新旧)的这些差距的最好方法是什么?

解决方案

首先,让我们总结表中每小时的条目数。

  SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d%k:00:00')AS DATETIME)小时,
COUNT
FROM table
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d%k:00:00')AS DATETIME)

现在,如果你每六分钟记录一次(每小时十次),所有的samplecount值都应为十。此表达式: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d%k:00:00')AS DATETIME)看起来很有毛,但它只是截断您的时间戳



这是相当高效的,并会让你开始。如果您可以在entry_time列设置索引并将查询限制为(如此处所示的昨天的示例),那么效率非常高。

  SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d%k:00:00')AS DATETIME)小时,
COUNT(*)samplecount
FROM table
WHERE entry_time> = CURRENT_DATE - INTERVAL 1 DAY
AND entry_time< CURRENT_DATE
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d%k:00:00')AS DATETIME)

但是,检测缺少样本的整个小时并不是很好。它也对你的抽样中的抖动有点敏感。也就是说,如果您的最高时间样本有时是半早(10:59:30),有时是半秒晚(11:00:30),则小时摘要计数将关闭。所以,这个小时摘要事(或天摘要,或分钟摘要等)是不防弹的。



您需要一个自连接查询来完全正确地获取内容;它是一个更多的一个毛球,没有几乎同样高效。



让我们从创建一个虚拟表(子查询)开始编号样本。 (这是MySQL的一个痛苦;一些其他昂贵的DBMS使它更容易。)

  SELECT @sample:= @ sample + 1 AS entry_num,c.entry_time,c.value 
FROM(
SELECT entry_time,value
FROM table
ORDER BY entry_time
)C,
(SELECT @sample:= 0)s

这个小虚拟表提供了entry_num,entry_time,value。



下一步,我们加入它自己。

  .entry_num,one.entry_time,one.value,
TIMEDIFF(two.value,one.value)interval
FROM(
/ * virtual table * /
)ONE
JOIN(
/ *同一虚拟表* /
)TWO ON(TWO.entry_num - 1 = ONE.entry_num)

这会将表格彼此相隔一行,由一个条目偏移,由JOIN的ON子句控制。



最后,我们从这个表中选择一个 interval 大于您的阈值,并且在缺少的样本之前有样本的时间。



所有自连接查询都是这样的。我告诉你这是一个头发。

 选择one.entry_num,one.entry_time,one.value,
TIMEDIFF (two.value,one.value)interval
FROM(
SELECT @sample:= @ sample + 1 AS entry_num,c.entry_time,c.value
FROM(
SELECT entry_time,value
FROM table
ORDER BY entry_time
)C,
(SELECT @sample:= 0)s
)ONE
JOIN $ b SELECT @ sample2:= @ sample2 + 1 AS entry_num,c.entry_time,c.value
FROM(
SELECT entry_time,value
FROM table
ORDER BY entry_time
)C,
(SELECT @ sample2:= 0)s
)TWO ON(TWO.entry_num - 1 = ONE.entry_num)

如果你必须在一个大型表上进行生产,你可能需要为一个子集的数据。例如,您可以每天对前两天的样本执行此操作。这将是非常有效的,也将确保你没有忽略任何丢失的样品在午夜。要做到这一点你的小rownumbered虚拟表将是这样的。

  SELECT @sample:= @ sample + 1 AS entry_num,c .entry_time,c.value 
FROM(
SELECT entry_time,value
FROM table
ORDER BY entry_time
WHERE entry_time> = CURRENT_DATE - INTERVAL 2 DAY
AND entry_time< CURRENT_DATE / *昨天但不是今天* /
)C,
(SELECT @sample:= 0)s


Lets say we have a database table with two columns, entry_time and value. entry_time is timestamp while value can be any other datatype. The records are relatively consistent, entered in roughly x minute intervals. For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.

In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?

解决方案

To start with, let us summarize the number of entries by hour in your table.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.

This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
   AND entry_time < CURRENT_DATE
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.

You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.

Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
    ) C,
    (SELECT @sample:=0) s

This little virtual table gives entry_num, entry_time, value.

Next step, we join it to itself.

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
     /* virtual table */
  ) ONE
  JOIN (
     /* same virtual table */
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.

Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.

The over all self join query is this. I told you it was a hairball.

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
    SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample:=0) s
  ) ONE
  JOIN (
    SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample2:=0) s
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
         WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
           AND entry_time < CURRENT_DATE /*yesterday but not today*/
    ) C,
    (SELECT @sample:=0) s

这篇关于在MySQL中找到时间序列数据中的差距的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆