如何在BigQuery中使用每个用户的线性插值法来填充不规则丢失的时间序列值? [英] How to fill irregularly missing time-series values with linear interepolation by each user in BigQuery?

查看:114
本文介绍了如何在BigQuery中使用每个用户的线性插值法来填充不规则丢失的时间序列值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据,每个用户的 时间序列 值均不规则地 丢失了,我想使用BigQuery标准SQL使用线性插值以一定间隔转换它.

I have data which has missing time series values irregulaly for each users, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.

+------+---------------------+-------+
| name |        time         | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 |     3 |
| Jane | 2020-11-14 09:05:00 |     5 |
| Jane | 2020-11-14 09:07:00 |     1 |
| Jane | 2020-11-14 09:09:00 |     8 |
| Jane | 2020-11-14 09:10:00 |     4 |
| Kay  | 2020-11-14 09:01:00 |     7 |
| Kay  | 2020-11-14 09:04:00 |     1 |
| Kay  | 2020-11-14 09:05:00 |    10 |
| Kay  | 2020-11-14 09:09:00 |     6 |
| Kay  | 2020-11-14 09:10:00 |     7 |
+------+---------------------+-------+

,我想将其转换如下:

+------+---------------------+-------+-----------------+
| name |        time         | value |                 |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3     |                 |
| Jane | 2020-11-14 09:02:00 | 3.5   | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4     | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5     |                 |
| Jane | 2020-11-14 09:06:00 | 3     | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1     |                 |
| Jane | 2020-11-14 09:08:00 | 4.5   | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8     |                 |
| Jane | 2020-11-14 09:10:00 | 4     |                 |
| Kay  | 2020-11-14 09:01:00 | 7     |                 |
| Kay  | 2020-11-14 09:02:00 | 5     | <= interpolaetd |
| Kay  | 2020-11-14 09:03:00 | 3     | <= interpolaetd |
| Kay  | 2020-11-14 09:04:00 | 1     |                 |
| Kay  | 2020-11-14 09:05:00 | 10    |                 |
| Kay  | 2020-11-14 09:06:00 | 9     | <= interpolaetd |
| Kay  | 2020-11-14 09:07:00 | 8     | <= interpolaetd |
| Kay  | 2020-11-14 09:08:00 | 7     | <= interpolaetd |
| Kay  | 2020-11-14 09:09:00 | 6     |                 |
| Kay  | 2020-11-14 09:10:00 | 7     |                 |
+------+---------------------+-------+-----------------+

我可以问你一些聪明的解决方案吗?

Can I ask you some smart solution for this?

补充:这是这个stackoverflow问题.它非常相似,但不同之处在于该数据是 时间序列 数据,并且具有 每个用户 的名称.

Suppliment: This is an aplication problem for this stackoverflow question. It's very similar but different in that this data is time seris data and it has names for each users.

谢谢.

推荐答案

以下是BigQuery SQL

Below is for BigQuery SQL

#standardSQL
select name, time,
    ifnull(value, start_value 
      + (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
    ) as value_interpolated
from (
    select name, time, value,
    first_value(tick ignore nulls ) over win1 as start_tick,
    first_value(value ignore nulls) over win1 as start_value,
    first_value(tick ignore nulls ) over win2 as end_tick,
    first_value(value ignore nulls) over win2 as end_value,
    from (
        select name, time, t.time as tick, value
        from (
            select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
            from `project.dataset.table`
            group by name
        )
        cross join unnest(times) time 
        left join `project.dataset.table` t 
        using(name, time)
    )
    window 
        win1 as (partition by name order by time desc rows between current row and unbounded following),
        win2 as (partition by name order by time rows between current row and unbounded following)
)     

如果要应用于您的问题中的示例数据-输出为

if to apply to sample data from your question - output is

这篇关于如何在BigQuery中使用每个用户的线性插值法来填充不规则丢失的时间序列值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆