每组填充时间序列的有效方法 [英] Efficient way to Fill Time-Series per group
问题描述
我一直在寻找一种按时间填充每个组的时间序列数据集的方法。我使用的效率非常低下的方法是在每个组中拆分
数据集,并在其中应用自定义时间序列填充函数(在max和min之间创建序列,然后合并)该列表的所有元素。不用说,此操作不会通过拆分。
I was looking for a way to fill a time series data set by time, per group. The very very inefficient way I was using was to split
the data set per group and apply a custom time-series fill function (create sequence between max and min, and merge) in all elements of that list. Needless to say, this operations would not go pass the splitting.
我的数据集看起来像
source grp cnt
1: 83 2017-06-06 13:00:00 1
2: 83 2017-06-06 23:00:00 1
3: 83 2017-06-07 03:00:00 1
4: 83 2017-06-07 07:00:00 2
5: 83 2017-06-07 13:00:00 1
6: 83 2017-06-07 19:00:00 1
7: 83 2017-06-08 00:00:00 1
8: 83 2017-06-08 14:00:00 1
9: 83 2017-06-08 15:00:00 1
10: 83 2017-06-08 20:00:00 1
11: 137 2017-06-04 02:00:00 1
12: 137 2017-06-04 05:00:00 1
13: 137 2017-06-04 23:00:00 1
...
我的尝试是使用 tidyverse
通过利用 complete
函数的方法,即
My attempt was to use tidyverse
methods by utilising the complete
function, i.e.
library(tidyverse)
d1 %>%
group_by(source) %>%
complete(source, grp = seq(min(grp), max(grp), by = 'hour'))
但是,大约40-45秒后,出现了一个进度条(在某些dydyverse函数中显然是一个整洁的功能-我怀疑在这种情况下 complete
),估计需要9个小时才能完成。我的数据集非常大,这不是最简单的操作,因此我正在寻找真正有效的东西。
However, after about 40-45 seconds, a progress bar appeared (apparently a neat feature in some tidyverse functions - I suspect complete
in this case) which estimated 9 hours to completion. My dataset is very very big and this is not the lightest operation, so something really efficient is what I am looking for.
数据
#dput(d1)
structure(list(source = c("83", "83", "83", "83", "83", "83",
"83", "83", "83", "83", "137", "137", "137", "137", "137", "137",
"137", "137", "137", "137", "137", "137", "137", "137"), grp = structure(c(1496743200,
1496779200, 1496793600, 1496808000, 1496829600, 1496851200, 1496869200,
1496919600, 1496923200, 1496941200, 1496530800, 1496541600, 1496606400,
1496617200, 1496649600, 1496696400, 1496808000, 1496844000, 1496876400,
1496962800, 1497880800, 1497888000, 1497978000, 1497996000), class = c("POSIXct",
"POSIXt"), tzone = ""), cnt = c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), .Names = c("source", "grp", "cnt"), row.names = c(NA, -24L
), class = "data.frame")
推荐答案
看来 data.table
确实比 tidyverse
选项要快得多。因此,只需将以上内容转换为 data.table
(@ Frank的补全)就可以在不到3分钟的时间内完成操作。
It appears that data.table
is really much faster than the tidyverse
option. So merely translating the above into data.table
(compliments of @Frank) completed the operation in little under 3 minutes.
library(data.table)
mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]
new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed
这篇关于每组填充时间序列的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!