如何使用dplyr基于间隔执行联接? [英] How to perform a join based on intervals with dplyr?

查看:63
本文介绍了如何使用dplyr基于间隔执行联接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两列的数据框:一个分组变量和一个保留该分组变量的间隔时间.我有另一个带有日期列和值列的数据框.如何使用dplyr + tidyverse函数将这两个表有效地结合在一起?

I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?

library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")), 
                 y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")), 
                                   mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))), 
                          batch = c(1, 2, 3))

我想建立等效于以下内容的表:

I want to build the table that is equivalent to:

ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))

理想情况下,此方法应该可以合理快速地处理多达1,000,000行的数据集.如果它能在100,000,000上工作,那就更好了:).

Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).

推荐答案

如何:

ty %>% 
  mutate(batch = case_when(
  ty$date %within% gy$period[1] ~gy$batch[1],
  ty$date %within% gy$period[2] ~gy$batch[2],
  ty$date %within% gy$period[3] ~gy$batch[3]))

您显然需要定义case_when间隔.你有几个?过去,我使用catpaste0效果很好.

You would obviously need to define the case_when intervals. How many have you got? I've used cat and paste0 with good effect for that in the past.

经过编辑以反映OP的评论.这应该照顾NSE并允许以编程的方式生成case_w间隔:

Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:

ty %>%
  mutate(batch = eval(parse(text = paste0("case_when(",
                                      paste(
                                        paste0(
                                          "ty$date %within% gy$period[",
                                          seq_along(gy$period),
                                          "] ~gy$batch[",
                                          seq_along(gy$period),
                                          "]"
                                        ),
                                        collapse = ", "
                                      ), ")"))))

这篇关于如何使用dplyr基于间隔执行联接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆