根据R中的日期范围进行分类 [英] categorize based on date ranges in R
问题描述
如何根据大的R数据帧(> 2百万行)中的日期范围定义,在单独的,较小得多的R数据帧(12行)中对每一行进行分类?
How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
当通过 head(captures)
调用时,我的大型数据框看起来像这样:
My large dataframe, captures, looks similar to this when called via head(captures)
:
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
我的小型数据框(季节)整体看起来与此类似:
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
我需要在我的捕获数据框中添加一个季节"列,该值将根据 captures $ date
是否落在季节定义的范围内和位置来确定值.
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date
falls in the ranges defined in seasons.
这是我想出的一个长期解决方案,对我来说不起作用,因为我的数据帧很大.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
同样,这对我不起作用,因为R每次都崩溃.我也意识到这并没有利用R中的向量化技术.在此对您的任何帮助将不胜感激!
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!
推荐答案
如果您可以基于值的范围有效地执行 join
操作,那的确很棒.而不是平等.不幸的是,我不知道是否存在一个通用的解决方案.我暂时建议使用一个 for
循环.
It would be great indeed if you could do a join
operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for
loop.
最好沿着最高的数据来完成矢量化的效率.也就是说,如果我们在一个data.frame上循环并向量化另一个向量,则将较长的向量向量化并在较短的向量上循环更有意义.考虑到这一点,我们将按季节循环并矢量化200万行数据.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
您的数据:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
然后开始该过程,我们假设尚未定义所有数据的季节
:
And the start the process, we assume that all data's season
is as yet not defined:
dat$season <- NA
环绕每个季节的行:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015
这篇关于根据R中的日期范围进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!