如何在 BigQuery 中使用每个用户的线性插值来填充不规则缺失的时间序列值? [英] How to fill irregularly missing time-series values with linear interepolation by each user in BigQuery?

查看:23
本文介绍了如何在 BigQuery 中使用每个用户的线性插值来填充不规则缺失的时间序列值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据不规则地为每个用户丢失了时间序列值,我想要使用 BigQuery 标准 SQL 以特定间隔使用线性插值对其进行转换.

+------+---------------------+-------+|姓名 |时间 |价值 |+------+---------------------+-------+|简|2020-11-14 09:01:00 |3 ||简|2020-11-14 09:05:00 |5 ||简|2020-11-14 09:07:00 |1 ||简|2020-11-14 09:09:00 |8 ||简|2020-11-14 09:10:00 |4 ||凯 |2020-11-14 09:01:00 |7 ||凯 |2020-11-14 09:04:00 |1 ||凯 |2020-11-14 09:05:00 |10 ||凯 |2020-11-14 09:09:00 |6 ||凯 |2020-11-14 09:10:00 |7 |+------+---------------------+-------+

我想按如下方式转换它:

+------+---------------------+-------+-----------------+|姓名 |时间 |价值 ||+------+---------------------+-------+-----------------+|简|2020-11-14 09:01:00 |3 |||简|2020-11-14 09:02:00 |3.5 |<= 插值 ||简|2020-11-14 09:03:00 |4 |<= 插值 ||简|2020-11-14 09:04:00 |4.5 |<= 插值 ||简|2020-11-14 09:05:00 |5 |||简|2020-11-14 09:06:00 |3 |<= 插值 ||简|2020-11-14 09:07:00 |1 |||简|2020-11-14 09:08:00 |4.5 |<= 插值 ||简|2020-11-14 09:09:00 |8 |||简|2020-11-14 09:10:00 |4 |||凯 |2020-11-14 09:01:00 |7 |||凯 |2020-11-14 09:02:00 |5 |<= 插值 ||凯 |2020-11-14 09:03:00 |3 |<= 插值 ||凯 |2020-11-14 09:04:00 |1 |||凯 |2020-11-14 09:05:00 |10 |||凯 |2020-11-14 09:06:00 |9 |<= 插值 ||凯 |2020-11-14 09:07:00 |8 |<= 插值 ||凯 |2020-11-14 09:08:00 |7 |<= 插值 ||凯 |2020-11-14 09:09:00 |6 |||凯 |2020-11-14 09:10:00 |7 ||+------+---------------------+-------+-----------------+

我可以问你一些聪明的解决方案吗?

补充:这是

I have data which has missing time series values irregulaly for each users, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.

+------+---------------------+-------+
| name |        time         | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 |     3 |
| Jane | 2020-11-14 09:05:00 |     5 |
| Jane | 2020-11-14 09:07:00 |     1 |
| Jane | 2020-11-14 09:09:00 |     8 |
| Jane | 2020-11-14 09:10:00 |     4 |
| Kay  | 2020-11-14 09:01:00 |     7 |
| Kay  | 2020-11-14 09:04:00 |     1 |
| Kay  | 2020-11-14 09:05:00 |    10 |
| Kay  | 2020-11-14 09:09:00 |     6 |
| Kay  | 2020-11-14 09:10:00 |     7 |
+------+---------------------+-------+

and I'd like to convert it as follows:

+------+---------------------+-------+-----------------+
| name |        time         | value |                 |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3     |                 |
| Jane | 2020-11-14 09:02:00 | 3.5   | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4     | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5     |                 |
| Jane | 2020-11-14 09:06:00 | 3     | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1     |                 |
| Jane | 2020-11-14 09:08:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8     |                 |
| Jane | 2020-11-14 09:10:00 | 4     |                 |
| Kay  | 2020-11-14 09:01:00 | 7     |                 |
| Kay  | 2020-11-14 09:02:00 | 5     | <= interpolaetd |
| Kay  | 2020-11-14 09:03:00 | 3     | <= interpolaetd |
| Kay  | 2020-11-14 09:04:00 | 1     |                 |
| Kay  | 2020-11-14 09:05:00 | 10    |                 |
| Kay  | 2020-11-14 09:06:00 | 9     | <= interpolaetd |
| Kay  | 2020-11-14 09:07:00 | 8     | <= interpolaetd |
| Kay  | 2020-11-14 09:08:00 | 7     | <= interpolaetd |
| Kay  | 2020-11-14 09:09:00 | 6     |                 |
| Kay  | 2020-11-14 09:10:00 | 7     |                 |
+------+---------------------+-------+-----------------+

Can I ask you some smart solution for this?

Suppliment: This is an aplication problem for this stackoverflow question. It's very similar but different in that this data is time seris data and it has names for each users.

Thank you.

解决方案

Below is for BigQuery SQL

#standardSQL
select name, time,
    ifnull(value, start_value 
      + (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
    ) as value_interpolated
from (
    select name, time, value,
    first_value(tick ignore nulls ) over win1 as start_tick,
    first_value(value ignore nulls) over win1 as start_value,
    first_value(tick ignore nulls ) over win2 as end_tick,
    first_value(value ignore nulls) over win2 as end_value,
    from (
        select name, time, t.time as tick, value
        from (
            select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
            from `project.dataset.table`
            group by name
        )
        cross join unnest(times) time 
        left join `project.dataset.table` t 
        using(name, time)
    )
    window 
        win1 as (partition by name order by time desc rows between current row and unbounded following),
        win2 as (partition by name order by time rows between current row and unbounded following)
)     

if to apply to sample data from your question - output is

这篇关于如何在 BigQuery 中使用每个用户的线性插值来填充不规则缺失的时间序列值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆