重叠连接以计算间隔内的平均值？ [英] overlap join to calculate averages within intervals?

查看：113 发布时间：2017/3/12 12:02:08 r data.table dplyr

本文介绍了重叠连接以计算间隔内的平均值？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要在给定时间间隔（事件）内平均时间序列数据，基本上是这里。

如答案所示，我使用一个SQL语句对我的数据'长'格式。这里是一个例子：

  #create伪数据框架
 set.seed（1）
 data< ;  -  data.frame（
 date = seq（from = as.POSIXct（2014-01-01 00:00），
 to = as.POSIXct（2014-01-31 23： 00），
 by = 300），
 A = runif（8917），
 B = runif（8917），
 C = runif（8917），
 D = runif（8917）
）
 
＃转换为长格式
 require（dplyr）
 data<  -  data％>％
 （类，值，A：D）
 
＃创建虚拟事件
事件<  -  data.frame（
 id = c（blue，red green，yellow），
 start = as.POSIXct（c（2014-01-03 13:00，
2014-01-12 08:00，
 2014-01-18 10:00，
2014-01-27 23:00）），
 stop = as.POSIXct（c（2014-01-03 19:00 ，
2014-01-13 17:00，
2014-01-20 10:00，
2014-01-28 20:00））
）
 
 
事件中的#average值，按类分组
 require（sqldf）
 results<  -  sqldf（
 SELECT x.id ，y.class，avg（y.value）AS意味着
 FROM事件为x，数据为y 
 WHERE y.date在x.start和x.stop之间
 GROUP BY x.id ，y.class 
）

$ b

  id class mean 
 1 blue A 0.4879129 
 2 blue B 0.4945888 
 3 blue C 0.5312504 
 4蓝色D 0.4968260 
 5绿色A 0.5235671 
 6绿色B 0.5030602 
 7绿色C 0.5071219 
 8绿色D 0.5002010 
 9红色A 0.5122966 
 10红色B 0.4767966 
 11红色C 0.5032387 
 12红色D 0.5018389 
 13黄色A 0.4727868 
 14黄色B 0.4626688 
 15黄色C 0.4930207 
 16黄色D 0.5184966 $但是，由于我的真实数据是巨大的（长格式可以达到几百万行），SQL操作需要相当一些时间。
 
 
 有更有效的方法来做这个操作吗？我碰到了 data.table :: foverlaps ，这被称为重叠连接，但我不完全明白，如果这是我需要的。 p> 
 
 
如果有一个有效的方法来添加en'event'列到数据，指示每一行（日期），它落入哪个事件，然后我可以做一个分组摘要与dplyr比较SQL语句。但我不知道该怎么做... 
 
 
 任何专家的建议都不胜感激。
 
 
  更新 
 
 
 正如注释中建议的，我已经添加了一个索引到我的SQL语句。不幸的是，这并没有加速我的一个巨大的现实世界的问题。计算还需要〜40分钟。
 
 
 然后我复制粘贴data.table解决方案，由David提供，并印象深刻，看到它运行在完全相同的真实世界数据集上少于1秒。 
 
 
 我还是不明白为什么它做什么，但我花了一些时间学习data.table语法的动机肯定增加了很多。再次感谢！ 
解决方案
这里有一个可能的 data.table :: foverlaps  p> 
 
 
  library（data.table）
 setDT（data）[，`：=`（start = date，stop = date ）] 
 setkey（setDT（events），start，stop）
 foverlaps（data，events，nomatch = 0L）[，（Mean = mean（value）），keyby =。 ）] 
＃id class Mean 
＃1：blue A 0.4879129 
＃2：blue B 0.4945888 
＃3：blue C 0.5312504 
＃4：blue D 0.4968260 
＃5：green A 0.5235671 
＃6：green B 0.5030602 
＃7：green C 0.5071219 
＃8：green D 0.5002010 
＃9：red A 0.5122966 
＃10：red B 0.4767966 
＃11：red C 0.5032387 
＃12：red D 0.5018389 
＃13：yellow A 0.4727868 
＃14：yellow B 0.4626688 
＃15：yellow C 0.4930207 
＃16：yellow D 0.5184966 
  
 
 
 
 $ 
 
 $ b           / code>与数据中的列重叠。 
  键  c> 事件  foverlaps 并删除不匹配的时间间隔（ nomatch = 0L ）。 
  code> 由 id 和 
  
I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.

As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:
#create dummy data frame
set.seed(1)
data <- data.frame(
  date = seq(from = as.POSIXct("2014-01-01 00:00"),
             to = as.POSIXct("2014-01-31 23:00"),
             by = 300),
  A = runif(8917),
  B = runif(8917),
  C = runif(8917),
  D = runif(8917)
)

#convert to long format
require(dplyr)
data <- data %>%
  gather(class,value,A:D)

# create dummy events
events <- data.frame(
  id = c("blue","red","green","yellow"),
  start = as.POSIXct(c("2014-01-03 13:00",
                       "2014-01-12 08:00",
                       "2014-01-18 10:00",
                       "2014-01-27 23:00")),
  stop = as.POSIXct(c("2014-01-03 19:00",
                       "2014-01-13 17:00",
                       "2014-01-20 10:00",
                       "2014-01-28 20:00"))
)


#average value within events, grouped by class
require(sqldf)
results <- sqldf("
     SELECT x.id, y.class, avg(y.value) AS mean 
     FROM events as x, data as y 
     WHERE y.date between x.start and x.stop 
     GROUP BY x.id, y.class
")
which gives the desired output
       id class      mean
1    blue     A 0.4879129
2    blue     B 0.4945888
3    blue     C 0.5312504
4    blue     D 0.4968260
5   green     A 0.5235671
6   green     B 0.5030602
7   green     C 0.5071219
8   green     D 0.5002010
9     red     A 0.5122966
10    red     B 0.4767966
11    red     C 0.5032387
12    red     D 0.5018389
13 yellow     A 0.4727868
14 yellow     B 0.4626688
15 yellow     C 0.4930207
16 yellow     D 0.5184966
However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.

Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.

If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...

Any suggestions from the experts would be appreciated.

Update

As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.

Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset. 

I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that! 
 解决方案 
Here's a possible data.table::foverlaps solution
library(data.table)
setDT(data)[, `:=`(start = date, stop = date)]
setkey(setDT(events), start, stop)
foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
#         id class      Mean
#  1:   blue     A 0.4879129
#  2:   blue     B 0.4945888
#  3:   blue     C 0.5312504
#  4:   blue     D 0.4968260
#  5:  green     A 0.5235671
#  6:  green     B 0.5030602
#  7:  green     C 0.5071219
#  8:  green     D 0.5002010
#  9:    red     A 0.5122966
# 10:    red     B 0.4767966
# 11:    red     C 0.5032387
# 12:    red     D 0.5018389
# 13: yellow     A 0.4727868
# 14: yellow     B 0.4626688
# 15: yellow     C 0.4930207
# 16: yellow     D 0.5184966




The logic seems pretty straight forward to me.

Set start and stop columns within data to overlap against.
key the events data set by the same columns.
Run foverlaps and remove unmatched intervals (nomatch = 0L).
Calculate mean(value) by id and class 


                        
这篇关于重叠连接以计算间隔内的平均值？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

重叠连接以计算间隔内的平均值？ [英] overlap join to calculate averages within intervals?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

重叠连接以计算间隔内的平均值？ [英] overlap join to calculate averages within intervals?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭