在两个数据框之间使用dplyr :: mutate来创建基于日期范围的列 [英] Using dplyr::mutate between two dataframes to create column based on date range
问题描述
set.seed(1)
库(lubridate)
date.range< - ymd(paste(2008,3,1:31,sep = - ))
daily< - data.frame(date = date。范围,值= runif(31,min = 0,max = 45))
间隔< - data.frame(start = daily $ date [1:5],end = daily $ date [c(6, 9,15,24,31)])
实际上,我的日常数据框每天都有9年而我的间隔数据帧在这段时间内的任意日期都会超过条目。我想要做的是在我的间隔数据框中添加一列,名为
nhdd
,总和在每日对应的值到这个时间间隔(结束独占)。
例如,在这种情况下,这个新列的第一个条目是
sum(daily $ value [1:5])
,第二个是
sum(daily $ value [2:8])等等。
我尝试使用以下代码
间隔< - mutate(间隔,nhdd = sum(filter(daily,date> = start& date< end)$ value))
/ pre>
这不起作用,我认为这可能与正确引用列有关,但我不知道该去哪里。
我真的很想使用
dplyr
来解决这个问题,而不是一个循环,因为1100万行将使用dplyr足够长。我尝试使用更多的lubridate
,但dplyr似乎不支持Period类。
编辑:实际上使用日期从
as.Date
现在,而不是lubridate
,但是如何引用不同的数据框的基本问题从mutate
仍然解决方案
eps< - .Machine $ double.eps
库(dplyr)
间隔%>%
rowwise()%>%
mutate(nhdd = sum每日$值[之间(每日$ date,start,end-eps)]))
#start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
# 5 2008-03-05 2008-03-31 614.2481
如果您发现
dplyr
解决方案位缓慢(基本上是由于rowwise
),您可能需要使用data.table
纯粹的速度
setkey(setDT(间隔),开始,结束)
setDT(daily)[,date1:= date]
foverlaps(daily,by.x = c(date,date1),间隔)[,sum(value),by = c(start,end)]
#start end V1
#1:2008-03-01 2008-03-06 144.8444
#2:2008-03-02 2008-03-09 233.4530
#3:2008-03-03 2008 -03-15 319.5452
#4:2008-03-04 2008-03-24 531.7620
#5:2008-03-05 2008-03-31 614.2481
Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals
dataframe called nhdd
that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr
to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate
but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date
now instead of lubridate
but the basic question of how to refer to a different dataframe from within mutate
still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr
solution bit slow (basically due torowwise
), you might want to use data.table
for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481
这篇关于在两个数据框之间使用dplyr :: mutate来创建基于日期范围的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!