在两个数据框之间使用dplyr :: mutate来创建基于日期范围的列 [英] Using dplyr::mutate between two dataframes to create column based on date range

查看:106
本文介绍了在两个数据框之间使用dplyr :: mutate来创建基于日期范围的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我有两个数据帧。一个包含超过1100万行的开始日期,结束日期和其他变量。第二个数据框包含加热度天数的日常值(基本上是温度测量)。

  set.seed(1)
库(lubridate)
date.range< - ymd(paste(2008,3,1:31,sep = - ))
daily< - data.frame(date = date。范围,值= runif(31,min = 0,max = 45))
间隔< - data.frame(start = daily $ date [1:5],end = daily $ date [c(6, 9,15,24,31)])

实际上,我的日常数据框每天都有9年而我的间隔数据帧在这段时间内的任意日期都会超过条目。我想要做的是在我的间隔数据框中添加一列,名为 nhdd ,总和在每日对应的值到这个时间间隔(结束独占)。



例如,在这种情况下,这个新列的第一个条目是

  sum(daily $ value [1:5])

,第二个是

  sum(daily $ value [2:8])等等。 

我尝试使用以下代码

 间隔<  -  mutate(间隔,nhdd = sum(filter(daily,date> = start& date< end)$ value))
/ pre>

这不起作用,我认为这可能与正确引用列有关,但我不知道该去哪里。



我真的很想使用 dplyr 来解决这个问题,而不是一个循环,因为1100万行将使用dplyr足够长。我尝试使用更多的 lubridate ,但dplyr似乎不支持Period类。



编辑:实际上使用日期从 as.Date 现在,而不是 lubridate ,但是如何引用不同的数据框的基本问题从 mutate 仍然

解决方案

  eps<  -  .Machine $ double.eps 
库(dplyr)
间隔%>%
rowwise()%>%
mutate(nhdd = sum每日$值[之间(每日$ date,start,end-eps)]))
#start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
# 5 2008-03-05 2008-03-31 614.2481

如果您发现 dplyr 解决方案位缓慢(基本上是由于 rowwise ),您可能需要使用 data.table 纯粹的速度




setkey(setDT(间隔),开始,结束)
setDT(daily)[,date1:​​= date]
foverlaps(daily,by.x = c(date,date1),间隔)[,sum(value),by = c(start,end)]
#start end V1
#1:2008-03-01 2008-03-06 144.8444
#2:2008-03-02 2008-03-09 233.4530
#3:2008-03-03 2008 -03-15 319.5452
#4:2008-03-04 2008-03-24 531.7620
#5:2008-03-05 2008-03-31 614.2481


Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).

set.seed(1)    
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])

In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).

For example, in this case the first entry of this new column would be

sum(daily$value[1:5])

and the second would be

sum(daily$value[2:8]) and so on.

I tried using the following code

intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))

This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.

I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.

Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands

解决方案

eps <- .Machine$double.eps
library(dplyr)
intervals %>% 
  rowwise() %>% 
  mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
#       start        end     nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481

In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed

library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
#        start        end       V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

这篇关于在两个数据框之间使用dplyr :: mutate来创建基于日期范围的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆