重叠连接以计算间隔内的平均值? [英] overlap join to calculate averages within intervals?

查看:113
本文介绍了重叠连接以计算间隔内的平均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要在给定时间间隔(事件)内平均时间序列数据,基本上是这里



如答案所示,我使用一个SQL语句对我的数据'长'格式。这里是一个例子:

  #create伪数据框架
set.seed(1)
data< ; - data.frame(
date = seq(from = as.POSIXct(2014-01-01 00:00),
to = as.POSIXct(2014-01-31 23: 00),
by = 300),
A = runif(8917),
B = runif(8917),
C = runif(8917),
D = runif(8917)


#转换为长格式
require(dplyr)
data< - data%>%
(类,值,A:D)

#创建虚拟事件
事件< - data.frame(
id = c(blue,red green,yellow),
start = as.POSIXct(c(2014-01-03 13:00,
2014-01-12 08:00,
2014-01-18 10:00,
2014-01-27 23:00)),
stop = as.POSIXct(c(2014-01-03 19:00 ,
2014-01-13 17:00,
2014-01-20 10:00,
2014-01-28 20:00))



事件中的#average值,按类分组
require(sqldf)
results< - sqldf(
SELECT x.id ,y.class,avg(y.value)AS意味着
FROM事件为x,数据为y
WHERE y.date在x.start和x.stop之间
GROUP BY x.id ,y.class


$ b

  id class mean 
1 blue A 0.4879129
2 blue B 0.4945888
3 blue C 0.5312504
4蓝色D 0.4968260
5绿色A 0.5235671
6绿色B 0.5030602
7绿色C 0.5071219
8绿色D 0.5002010
9红色A 0.5122966
10红色B 0.4767966
11红色C 0.5032387
12红色D 0.5018389
13黄色A 0.4727868
14黄色B 0.4626688
15黄色C 0.4930207
16黄色D 0.5184966 $但是,由于我的真实数据是巨大的(长格式可以达到几百万行),SQL操作需要相当一些时间。



有更有效的方法来做这个操作吗?我碰到了 data.table :: foverlaps ,这被称为重叠连接,但我不完全明白,如果这是我需要的。 p>

如果有一个有效的方法来添加en'event'列到数据,指示每一行(日期),它落入哪个事件,然后我可以做一个分组摘要与dplyr比较SQL语句。但我不知道该怎么做...



任何专家的建议都不胜感激。



更新



正如注释中建议的,我已经添加了一个索引到我的SQL语句。不幸的是,这并没有加速我的一个巨大的现实世界的问题。计算还需要〜40分钟。



然后我复制粘贴data.table解决方案,由David提供,并印象深刻,看到它运行在完全相同的真实世界数据集上少于1秒。



我还是不明白为什么它做什么,但我花了一些时间学习data.table语法的动机肯定增加了很多。再次感谢!

解决方案

这里有一个可能的 data.table :: foverlaps p>

  library(data.table)
setDT(data)[,`:=`(start = date,stop = date )]
setkey(setDT(events),start,stop)
foverlaps(data,events,nomatch = 0L)[,(Mean = mean(value)),keyby =。 )]
#id class Mean
#1:blue A 0.4879129
#2:blue B 0.4945888
#3:blue C 0.5312504
#4:blue D 0.4968260
#5:green A 0.5235671
#6:green B 0.5030602
#7:green C 0.5071219
#8:green D 0.5002010
#9:red A 0.5122966
#10:red B 0.4767966
#11:red C 0.5032387
#12:red D 0.5018389
#13:yellow A 0.4727868
#14:yellow B 0.4626688
#15:yellow C 0.4930207
#16:yellow D 0.5184966




$


$ b / code>与数据中的列重叠。
键 c> 事件 foverlaps 并删除不匹配的时间间隔( nomatch = 0L )。
  • code> id


    I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.

    As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:

    #create dummy data frame
    set.seed(1)
    data <- data.frame(
      date = seq(from = as.POSIXct("2014-01-01 00:00"),
                 to = as.POSIXct("2014-01-31 23:00"),
                 by = 300),
      A = runif(8917),
      B = runif(8917),
      C = runif(8917),
      D = runif(8917)
    )
    
    #convert to long format
    require(dplyr)
    data <- data %>%
      gather(class,value,A:D)
    
    # create dummy events
    events <- data.frame(
      id = c("blue","red","green","yellow"),
      start = as.POSIXct(c("2014-01-03 13:00",
                           "2014-01-12 08:00",
                           "2014-01-18 10:00",
                           "2014-01-27 23:00")),
      stop = as.POSIXct(c("2014-01-03 19:00",
                           "2014-01-13 17:00",
                           "2014-01-20 10:00",
                           "2014-01-28 20:00"))
    )
    
    
    #average value within events, grouped by class
    require(sqldf)
    results <- sqldf("
         SELECT x.id, y.class, avg(y.value) AS mean 
         FROM events as x, data as y 
         WHERE y.date between x.start and x.stop 
         GROUP BY x.id, y.class
    ")
    

    which gives the desired output

           id class      mean
    1    blue     A 0.4879129
    2    blue     B 0.4945888
    3    blue     C 0.5312504
    4    blue     D 0.4968260
    5   green     A 0.5235671
    6   green     B 0.5030602
    7   green     C 0.5071219
    8   green     D 0.5002010
    9     red     A 0.5122966
    10    red     B 0.4767966
    11    red     C 0.5032387
    12    red     D 0.5018389
    13 yellow     A 0.4727868
    14 yellow     B 0.4626688
    15 yellow     C 0.4930207
    16 yellow     D 0.5184966
    

    However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.

    Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.

    If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...

    Any suggestions from the experts would be appreciated.

    Update

    As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.

    Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.

    I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!

    解决方案

    Here's a possible data.table::foverlaps solution

    library(data.table)
    setDT(data)[, `:=`(start = date, stop = date)]
    setkey(setDT(events), start, stop)
    foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
    #         id class      Mean
    #  1:   blue     A 0.4879129
    #  2:   blue     B 0.4945888
    #  3:   blue     C 0.5312504
    #  4:   blue     D 0.4968260
    #  5:  green     A 0.5235671
    #  6:  green     B 0.5030602
    #  7:  green     C 0.5071219
    #  8:  green     D 0.5002010
    #  9:    red     A 0.5122966
    # 10:    red     B 0.4767966
    # 11:    red     C 0.5032387
    # 12:    red     D 0.5018389
    # 13: yellow     A 0.4727868
    # 14: yellow     B 0.4626688
    # 15: yellow     C 0.4930207
    # 16: yellow     D 0.5184966
    


    The logic seems pretty straight forward to me.

    1. Set start and stop columns within data to overlap against.
    2. key the events data set by the same columns.
    3. Run foverlaps and remove unmatched intervals (nomatch = 0L).
    4. Calculate mean(value) by id and class

    这篇关于重叠连接以计算间隔内的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆