使用 dplyr 每 n 分钟分组一次 [英] Grouping every n minutes with dplyr

查看:16
本文介绍了使用 dplyr 每 n 分钟分组一次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含在给定日期的某个时间发生的 10 个事件,每个事件都有相应的值:

I have a dataset containing 10 events occuring at a certain time on a given day, with corresponding value for each event:

d1 <- data.frame(date = as.POSIXct(c("21/05/2010 19:59:37", "21/05/2010 08:40:30", 
                            "21/05/2010 09:21:00", "21/05/2010 22:29:50", "21/05/2010 11:27:34", 
                            "21/05/2010 18:25:14", "21/05/2010 15:16:01", "21/05/2010 09:41:53", 
                            "21/05/2010 15:01:29", "21/05/2010 09:02:06"), format ="%d/%m/%Y %H:%M:%S"),
                 value = c(11313,42423,64645,643426,1313313,1313,3535,6476,11313,9875))

我想每 3 分钟聚合一次结果,采用标准数据帧格式(从21/05/2010 00:00:00"到21/05/2010 23:57:00",以便数据帧有 480 个垃圾桶,每个垃圾桶 3 分钟)

I want to aggregate the results every 3 minutes, in a standard dataframe format (from "21/05/2010 00:00:00" to "21/05/2010 23:57:00", so that the dataframe has 480 bins of 3 minutes each)

首先,我创建一个包含每个 3 分钟 bin 的数据框:

First, I create a dataframe containing bins of 3 minutes each:

d2 <- data.frame(date = seq(as.POSIXct("2010-05-21 00:00:00"), 
                            by="3 min", length.out=(1440/3)))

然后,我将两个数据帧合并在一起并删除 NA:

Then, I merge the two dataframes together and remove NAs:

library(dplyr)
m <- merge(d1, d2, all=TRUE) %>% mutate(value = ifelse(is.na(value),0,value))

最后,我使用 xts 包中的 period.apply() 对每个 bin 的值求和:

Finally, I use period.apply() from the xts package to sum the values for each bin:

library(xts)
a <- period.apply(m$value, endpoints(m$date, "minutes", 3), sum)

有没有更有效的方法来做到这一点?感觉不是很理想.

Is there a more efficient way to do this ? It does not feel optimal.

更新 #1

我在 Joshua 回答后调整了我的代码:

I adjusted my code after Joshua's answer:

library(xts)
startpoints <- function (x, on = "months", k = 1) { 
  head(endpoints(x, on, k) + 1, -1) 
}

m <- seq(as.POSIXct("2010-05-21 00:00:00"), by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m))
y <- period.apply(x, c(0,startpoints(x, "minutes", 3)), sum, na.rm=TRUE)

我不知道 na.rm=TRUE 可以与 period.apply() 一起使用,它现在允许我跳过 mutate(value= ifelse(is.na(value),0,value)).这是向前迈出的一步,我实际上对这里的 xts 方法感到满意,但我想知道是否有 pure dplyr 解决方案我可以在这种情况下使用.

I wasn't aware that na.rm=TRUE could be used with period.apply(), which now allows me to skip mutate(value = ifelse(is.na(value),0,value)). It's a step forward and I'm actually pleased with the xts approach here but I would like to know if there is a pure dplyr solution I could use in such a situation.

更新 #2

在尝试了 Khashaa 的回答后,我遇到了一个错误,因为我的时区没有被指定.所以我有:

After trying Khashaa's answer, I had an error because my timezone was not specified. So I had:

> tail(d4)
               interval sumvalue
476 2010-05-21 23:45:00       NA
477 2010-05-21 23:48:00       NA
478 2010-05-21 23:51:00       NA
479 2010-05-21 23:54:00       NA
480 2010-05-21 23:57:00    11313
481 2010-05-22 02:27:00   643426
> d4[450,]
               interval sumvalue
450 2010-05-21 22:27:00       NA

现在,在 Sys.setenv(TZ="UTC") 之后,一切正常.

Now, after Sys.setenv(TZ="UTC"), it all works fine.

推荐答案

lubridate-dplyr-esque 解决方案.

lubridate-dplyr-esque solution.

library(lubridate)
library(dplyr)
d2 <- data.frame(interval = seq(ymd_hms('2010-05-21 00:00:00'), by = '3 min',length.out=(1440/3)))
d3 <- d1 %>% 
  mutate(interval = floor_date(date, unit="hour")+minutes(floor(minute(date)/3)*3)) %>% 
  group_by(interval) %>% 
  mutate(sumvalue=sum(value))  %>% 
  select(interval,sumvalue) 
d4 <- merge(d2,d3, all=TRUE) # better if left_join is used
tail(d4)
#               interval sumvalue
#475 2010-05-21 23:42:00       NA
#476 2010-05-21 23:45:00       NA
#477 2010-05-21 23:48:00       NA
#478 2010-05-21 23:51:00       NA
#479 2010-05-21 23:54:00       NA
#480 2010-05-21 23:57:00       NA
d4[450,]
#               interval sumvalue
#450 2010-05-21 22:27:00   643426

如果您喜欢使用 Date(我不喜欢),您可以省去 lubridate,并用 left_join 替换最终合并.

If you are comfortable working with Date (I am not), you can dispense with lubridate, and replace the final merge with left_join.

这篇关于使用 dplyr 每 n 分钟分组一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆