根据丢失的时间范围自动填充数据库中的丢失行 [英] Autofill missing row in database based on missing time range

查看:74
本文介绍了根据丢失的时间范围自动填充数据库中的丢失行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgresql数据库中有一个表,该表存储日期时间以及一些类似如下的整数:

I have a table in Postgresql database which store a datetime along with some integer that look like this:

      dt                total                                                   
--------------------------------                                        
2019-07-01 10:00:00     150                                      
2019-07-01 10:15:00     153                                      
2019-07-01 10:30:00     155                                      
2019-07-01 10:45:00     160                                      
2019-07-01 11:00:00     161                                   
....

您可以看到dt列中的日期时间将连续15分钟。我的问题是,有时输入数据可能会丢失某些行。

As you can see that datetime in dt column will be in 15 minutes range consecutively. My question here is that, there might be sometime that incoming data missed some row.

例如:

     dt                total                                                   
--------------------------------                                        
2019-07-01 10:00:00     150                                      
2019-07-01 10:15:00     153                                      
2019-07-01 10:30:00     155                                      
2019-07-01 10:45:00     160                                      
2019-07-01 11:00:00     161
2019-07-01 11:15:00     163
2019-07-01 12:00:00     170

在此示例中,有n = 2个丢失的行,分别为时间11:30和11:45。我要在此处执行的操作是自动填充这些行的日期时间,并使用缺少行前(11:15)的最后一行的总列和丢失行后(12:00)的第一行的平均值作为每个丢失项的总数行

From this example, there are n=2 missing rows which are row of time 11:30 and 11:45. What I wanted to do here is to autofill datetime of those rows and use average from the total column of the last row before missing row (11:15) and the first row after missing row (12:00) as a total for each missing row

在此示例中,每个缺失行的总列将相加(170-163)/(n + 1)= 7/3 = 2.333(让我们使用3

For this example, each missing row's total column will be add by (170-163)/(n+1) = 7/3 = 2.333 (let use 3 decimal digit here) So the result will become like this:

     dt                total                                                   
--------------------------------                                        
2019-07-01 10:00:00     150                                      
2019-07-01 10:15:00     153                                      
2019-07-01 10:30:00     155                                      
2019-07-01 10:45:00     160                                      
2019-07-01 11:00:00     161
2019-07-01 11:15:00     163
2019-07-01 11:30:00     165.333
2019-07-01 11:45:00     167.666
2019-07-01 12:00:00     170

我认为这不可能直接用SQL完成。因此,我认为Python可能有助于解决此问题。有想法吗?

I think it cannot be done directly with SQL. So, I think Python might help solving this purpose. Any idea?

推荐答案

您可以使用 generate_series()和一些数学运算。以下假定总计在增加(如您的示例数据中所示):

You can use generate_series() and some math. The following assumes that total is increasing (as in your sample data):

select d.dt, seqnum,
       coalesce(t.total,
                (max(t.total) over (order by d.dt asc) +
                 (min(t.total) over (order by d.dt desc) - 
                  max(t.total) over (order by d.dt asc)
                 ) *
                 (seqnum - max(seqnum) filter (where t.total is not null) over (order by d.dt asc)) /
                  nullif(min(seqnum) filter (where t.total is not null) over (order by d.dt desc) -
                         max(seqnum) filter (where t.total is not null) over (order by d.dt asc),
                         0
                        )
                 )
                )
from (select dt, count(*) over (order by dt) as seqnum
      from (select generate_series(min(dt), max(dt), interval '15 minute') as dt
            from t
            ) d
     ) d left join
     t
     on t.dt = d.dt;

这里是一个小提琴。

该计算很麻烦,因为您需要进行加权平均才能得出中间值。公式为:

The calculation is just cumbersome because you need to do a weighted average to get the in-between values. The formula is:

prev_value + (next_value - previous_value) * ratio

,比率为:

(current_time - prev_time) / (next_time - prev_time)

但这不是时间,而是使用顺序计数。

But instead of times, this uses a sequential count.

这篇关于根据丢失的时间范围自动填充数据库中的丢失行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆