重叠连接以计算间隔内的平均值? [英] overlap join to calculate averages within intervals?
问题描述
我经常需要在给定时间间隔(事件)内平均时间序列数据,基本上是这里。
如答案所示,我使用一个SQL语句对我的数据'长'格式。这里是一个例子:
#create伪数据框架
set.seed(1)
data< ; - data.frame(
date = seq(from = as.POSIXct(2014-01-01 00:00),
to = as.POSIXct(2014-01-31 23: 00),
by = 300),
A = runif(8917),
B = runif(8917),
C = runif(8917),
D = runif(8917)
)
#转换为长格式
require(dplyr)
data< - data%>%
(类,值,A:D)
#创建虚拟事件
事件< - data.frame(
id = c(blue,red green,yellow),
start = as.POSIXct(c(2014-01-03 13:00,
2014-01-12 08:00,
2014-01-18 10:00,
2014-01-27 23:00)),
stop = as.POSIXct(c(2014-01-03 19:00 ,
2014-01-13 17:00,
2014-01-20 10:00,
2014-01-28 20:00))
)
事件中的#average值,按类分组
require(sqldf)
results< - sqldf(
SELECT x.id ,y.class,avg(y.value)AS意味着
FROM事件为x,数据为y
WHERE y.date在x.start和x.stop之间
GROUP BY x.id ,y.class
)
$ b
id class mean
1 blue A 0.4879129
2 blue B 0.4945888
3 blue C 0.5312504
4蓝色D 0.4968260
5绿色A 0.5235671
6绿色B 0.5030602
7绿色C 0.5071219
8绿色D 0.5002010
9红色A 0.5122966
10红色B 0.4767966
11红色C 0.5032387
12红色D 0.5018389
13黄色A 0.4727868
14黄色B 0.4626688
15黄色C 0.4930207
16黄色D 0.5184966 $但是,由于我的真实数据是巨大的(长格式可以达到几百万行),SQL操作需要相当一些时间。
有更有效的方法来做这个操作吗?我碰到了 data.table :: foverlaps
,这被称为重叠连接,但我不完全明白,如果这是我需要的。 p>
如果有一个有效的方法来添加en'event'列到数据,指示每一行(日期),它落入哪个事件,然后我可以做一个分组摘要与dplyr比较SQL语句。但我不知道该怎么做...
任何专家的建议都不胜感激。
更新
正如注释中建议的,我已经添加了一个索引到我的SQL语句。不幸的是,这并没有加速我的一个巨大的现实世界的问题。计算还需要〜40分钟。
然后我复制粘贴data.table解决方案,由David提供,并印象深刻,看到它运行在完全相同的真实世界数据集上少于1秒。
我还是不明白为什么它做什么,但我花了一些时间学习data.table语法的动机肯定增加了很多。再次感谢!
解决方案这里有一个可能的 data.table :: foverlaps
p>
library(data.table)
setDT(data)[,`:=`(start = date,stop = date )]
setkey(setDT(events),start,stop)
foverlaps(data,events,nomatch = 0L)[,(Mean = mean(value)),keyby =。 )]
#id class Mean
#1:blue A 0.4879129
#2:blue B 0.4945888
#3:blue C 0.5312504
#4:blue D 0.4968260
#5:green A 0.5235671
#6:green B 0.5030602
#7:green C 0.5071219
#8:green D 0.5002010
#9:red A 0.5122966
#10:red B 0.4767966
#11:red C 0.5032387
#12:red D 0.5018389
#13:yellow A 0.4727868
#14:yellow B 0.4626688
#15:yellow C 0.4930207
#16:yellow D 0.5184966
$
$ b / code>与数据
中的列重叠。
键 c> 事件
foverlaps
并删除不匹配的时间间隔( nomatch = 0L
)。
code>
由 id
和
I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.
As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:
#create dummy data frame
set.seed(1)
data <- data.frame(
date = seq(from = as.POSIXct("2014-01-01 00:00"),
to = as.POSIXct("2014-01-31 23:00"),
by = 300),
A = runif(8917),
B = runif(8917),
C = runif(8917),
D = runif(8917)
)
#convert to long format
require(dplyr)
data <- data %>%
gather(class,value,A:D)
# create dummy events
events <- data.frame(
id = c("blue","red","green","yellow"),
start = as.POSIXct(c("2014-01-03 13:00",
"2014-01-12 08:00",
"2014-01-18 10:00",
"2014-01-27 23:00")),
stop = as.POSIXct(c("2014-01-03 19:00",
"2014-01-13 17:00",
"2014-01-20 10:00",
"2014-01-28 20:00"))
)
#average value within events, grouped by class
require(sqldf)
results <- sqldf("
SELECT x.id, y.class, avg(y.value) AS mean
FROM events as x, data as y
WHERE y.date between x.start and x.stop
GROUP BY x.id, y.class
")
which gives the desired output
id class mean
1 blue A 0.4879129
2 blue B 0.4945888
3 blue C 0.5312504
4 blue D 0.4968260
5 green A 0.5235671
6 green B 0.5030602
7 green C 0.5071219
8 green D 0.5002010
9 red A 0.5122966
10 red B 0.4767966
11 red C 0.5032387
12 red D 0.5018389
13 yellow A 0.4727868
14 yellow B 0.4626688
15 yellow C 0.4930207
16 yellow D 0.5184966
However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.
Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps
, which is called an 'overlap join', but I don't fully understand if this is what I need.
If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...
Any suggestions from the experts would be appreciated.
Update
As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.
Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.
I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!
解决方案 Here's a possible data.table::foverlaps
solution
library(data.table)
setDT(data)[, `:=`(start = date, stop = date)]
setkey(setDT(events), start, stop)
foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
# id class Mean
# 1: blue A 0.4879129
# 2: blue B 0.4945888
# 3: blue C 0.5312504
# 4: blue D 0.4968260
# 5: green A 0.5235671
# 6: green B 0.5030602
# 7: green C 0.5071219
# 8: green D 0.5002010
# 9: red A 0.5122966
# 10: red B 0.4767966
# 11: red C 0.5032387
# 12: red D 0.5018389
# 13: yellow A 0.4727868
# 14: yellow B 0.4626688
# 15: yellow C 0.4930207
# 16: yellow D 0.5184966
The logic seems pretty straight forward to me.
- Set
start
and stop
columns within data
to overlap against.
key
the events
data set by the same columns.
- Run
foverlaps
and remove unmatched intervals (nomatch = 0L
).
- Calculate
mean(value)
by id
and class
这篇关于重叠连接以计算间隔内的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!