根据日期从一个数据框中输出各种子集 [英] Outputting various subsets from one data frame based on dates
问题描述
我想基于从单独的数据框中定义的日期序列创建大量数据子集.例如,一个数据框将具有跨多年的日期和每日记录的值.我在下面创建了一个假设的数据框.我想根据其他地方定义的开始和结束日期,从此数据框中进行各种子集处理.
I want to create numerous subsets of data based on date sequences defined from a separate dataframe. For example, one dataframe will have dates and daily recorded values across multiple years. I have created a hypothetical dataframe below. I want to conduct various subsets from this dataframe based on start and end dates defined elsewhere.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:3000, 300*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(seq(as.Date("2004/1/1"), by = "day", length.out = 3000))
Example <- cbind(df1,df2)
开始日期和结束日期对应于特定样本之前1年的序列.因此,如果我在2006年5月18日采样,我希望在2005年5月17日到2006年5月17日之间的所有值.我通过Lubridate软件包在下面创建了一系列日期示例.
The start and end dates correspond to a sequence of 1 year prior to a particular sample. So if I sampled on the 18/05/2006, I would want all values between 17/05/2005 - 17/05/2006. I have created an example series of dates below via the Lubridate package.
Sample_dates<- as.data.frame(dmy(c("18/05/2006","07/05/2010","01/04/2011",
"26/10/2006","24/09/2010","27/09/2011")))
End_dates <- (Sample_dates)-days(1)
Start_dates <- (End_dates)-years(1)
Sequence_dates <- cbind(Start_dates,End_dates)
colnames(Sequence_dates) <- c("Startdates", "Enddates")
随后,基于第二个数据帧(Sequence_dates)中定义的日期序列,我应该从原始数据帧(示例)获得6个子集输出.实际上,还存在多个采样日期,因此在一个编码段中识别这些开始日期和结束日期的功能将比手动选择每个开始日期和结束日期更可取.我认为循环功能似乎很有可能,我根据在其他地方找到的类似(更复杂)的帖子尝试了以下方法. For()循环确定彼此之间的日期并计算平均值.
Subsequently, I should have 6 subsetted outputs from the original dataframe (Example) based on date sequences defined in the second dataframe (Sequence_dates). In reality, several more sample dates exist so a function recognizing these start and end dates in one section of coding would be preferable to manually selecting each start and finish date. I thought a loop function seems to be strong possibility and I tried the following based on a similar (more complex) post found elsewhere. For() loop to ID dates that are between others and calculate a mean value.
for (i in 1:nrow(Sequence_dates)){
Selected_dates[i] = is.between(Sequence_dates$Startdates[i], Discharge_dates$Enddates[i])
}
但是,R无法识别is.between和我之间的代码可能很草率,因为我之前从未进行过循环.任何帮助,将不胜感激!
However, R does not recognise is.between and I appreciate the code may be sloppy with me never conducting a loop before. Any help on this would be much appreciated!
詹姆斯
推荐答案
我可能会执行以下操作.
I might do as following.
似乎只有必要的结束日期,因为开始日期就在一年之前.
Only end dates seem to be necessary as start dates are just 1 year before.
循环是使用lapply()
进行的,该循环遍历所有结束日期.
Loop is made using lapply()
which iterates over all end dates.
主要通过difftime()
进行子设置,方法是过滤两个日期之间的任何非零时差.
Subsetting is done mainly with difftime()
by filtering any non-zero time difference between the two dates.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:3000, 300*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(seq(as.Date("2004/1/1"), by = "day", length.out = 3000))
df <- data.frame(df1, df2)
names(df) <- c("val", "date")
library(lubridate)
ends <- c(dmy(c("18/05/2006","07/05/2010","01/04/2011","26/10/2006","24/09/2010","27/09/2011"))) - days(1)
subs <- lapply(ends, function(x) {
df[difftime(df$date, x - years(1)) >= 0 & difftime(df$date, x) <= 0, ]
})
length(subs)
# [1] 6
min(subs[[1]]$date)
# [1] "2005-05-17"
max(subs[[1]]$date)
# [1] "2006-05-17"
这篇关于根据日期从一个数据框中输出各种子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!