每组填充时间序列的有效方法 [英] Efficient way to Fill Time-Series per group

查看:97
本文介绍了每组填充时间序列的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找一种按时间填充每个组的时间序列数据集的方法。我使用的效率非常低下的方法是在每个组中拆分数据集,并在其中应用自定义时间序列填充函数(在max和min之间创建序列,然后合并)该列表的所有元素。不用说,此操作不会通过拆分。

I was looking for a way to fill a time series data set by time, per group. The very very inefficient way I was using was to split the data set per group and apply a custom time-series fill function (create sequence between max and min, and merge) in all elements of that list. Needless to say, this operations would not go pass the splitting.

我的数据集看起来像


    source                 grp cnt
 1:     83 2017-06-06 13:00:00   1
 2:     83 2017-06-06 23:00:00   1
 3:     83 2017-06-07 03:00:00   1
 4:     83 2017-06-07 07:00:00   2
 5:     83 2017-06-07 13:00:00   1
 6:     83 2017-06-07 19:00:00   1
 7:     83 2017-06-08 00:00:00   1
 8:     83 2017-06-08 14:00:00   1
 9:     83 2017-06-08 15:00:00   1
10:     83 2017-06-08 20:00:00   1
11:    137 2017-06-04 02:00:00   1
12:    137 2017-06-04 05:00:00   1
13:    137 2017-06-04 23:00:00   1
...


我的尝试是使用 tidyverse 通过利用 complete 函数的方法,即

My attempt was to use tidyverse methods by utilising the complete function, i.e.

library(tidyverse)

d1 %>% 
 group_by(source) %>% 
 complete(source, grp = seq(min(grp), max(grp), by = 'hour'))

但是,大约40-45秒后,出现了一个进度条(在某些dydyverse函数中显然是一个整洁的功能-我怀疑在这种情况下 complete ),估计需要9个小时才能完成。我的数据集非常大,这不是最简单的操作,因此我正在寻找真正有效的东西。

However, after about 40-45 seconds, a progress bar appeared (apparently a neat feature in some tidyverse functions - I suspect complete in this case) which estimated 9 hours to completion. My dataset is very very big and this is not the lightest operation, so something really efficient is what I am looking for.

数据

#dput(d1)
structure(list(source = c("83", "83", "83", "83", "83", "83", 
"83", "83", "83", "83", "137", "137", "137", "137", "137", "137", 
"137", "137", "137", "137", "137", "137", "137", "137"), grp = structure(c(1496743200, 
1496779200, 1496793600, 1496808000, 1496829600, 1496851200, 1496869200, 
1496919600, 1496923200, 1496941200, 1496530800, 1496541600, 1496606400, 
1496617200, 1496649600, 1496696400, 1496808000, 1496844000, 1496876400, 
1496962800, 1497880800, 1497888000, 1497978000, 1497996000), class = c("POSIXct", 
"POSIXt"), tzone = ""), cnt = c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), .Names = c("source", "grp", "cnt"), row.names = c(NA, -24L
), class = "data.frame")


推荐答案

看来 data.table 确实比 tidyverse 选项要快得多。因此,只需将以上内容转换为 data.table (@ Frank的补全)就可以在不到3分钟的时间内完成操作。

It appears that data.table is really much faster than the tidyverse option. So merely translating the above into data.table(compliments of @Frank) completed the operation in little under 3 minutes.

library(data.table)

mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]

new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed

这篇关于每组填充时间序列的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆