根据R中的日期范围进行分类 [英] categorize based on date ranges in R

查看:43
本文介绍了根据R中的日期范围进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何根据大的R数据帧(> 2百万行)中的日期范围定义,在单独的,较小得多的R数据帧(12行)中对每一行进行分类?

How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?

当通过 head(captures)调用时,我的大型数据框看起来像这样:

My large dataframe, captures, looks similar to this when called via head(captures) :

       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1

我的小型数据框(季节)整体看起来与此类似:

My small dataframe, seasons, looks similar to this in its entirety:

Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15 

我需要在我的捕获数据框中添加一个季节"列,该值将根据 captures $ date 是否落在季节定义的范围内和位置来确定值.

I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.

这是我想出的一个长期解决方案,对我来说不起作用,因为我的数据帧很大.

Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.

#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
  for (j in 1:length(captures$id{
    captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
  }
}

同样,这对我不起作用,因为R每次都崩溃.我也意识到这并没有利用R中的向量化技术.在此对您的任何帮助将不胜感激!

Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!

推荐答案

如果您可以基于值的范围有效地执行 join 操作,那的确很棒.而不是平等.不幸的是,我不知道是否存在一个通用的解决方案.我暂时建议使用一个 for 循环.

It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.

最好沿着最高的数据来完成矢量化的效率.也就是说,如果我们在一个data.frame上循环并向量化另一个向量,则将较长的向量向量化并在较短的向量上循环更有意义.考虑到这一点,我们将按季节循环并矢量化200万行数据.

The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.

您的数据:

txt <- "Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)

然后开始该过程,我们假设尚未定义所有数据的季节:

And the start the process, we assume that all data's season is as yet not defined:

dat$season <- NA

环绕每个季节的行:

for (i in seq_len(nrow(seasons))) {
  dat$season <- ifelse(is.na(dat$season) &
                         dat$date >= seasons$Opening.Date[i] &
                         dat$date < seasons$Closing.Date[i],
                       seasons$Season[i], dat$season)                       
}
dat
#        id       date sex season
# 1  160520 2016-11-22   1   2016
# 2 1029735 2016-11-12   1   2016
# 3 1885200 2016-11-05   1   2016
# 4 2058366 2015-09-26   2   2015
# 5 2058367 2015-09-26   1   2015
# 6 2058368 2015-09-26   1   2015

这篇关于根据R中的日期范围进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆