如何在 tidyverse 框架中最有效地过滤另一个数据帧中的值? [英] How to most efficiently filter a dataframe conditionnaly of values in another one, in the tidyverse framework?

查看:56
本文介绍了如何在 tidyverse 框架中最有效地过滤另一个数据帧中的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有 ID 列和一个 lubridate 时间间隔列的数据帧 df1,我想过滤(子采样)一个具有 ID 和 DateTime 列的数据帧 df2,以便只有带有 DateTime 的 df2 行符合相应的 ID 间隔在 df1 中保留.我想在 tidyverse 框架中这样做.
使用连接可以轻松完成(请参见下面的示例),但我想知道是否有更直接的解决方案(可能基于 purrr)可以避免连接然后从第二个数据帧中删除时间间隔数据.谢谢.

I have a dataframe df1 with an ID column and a lubridate time interval column, and I want to filter (subsample) a dataframe df2, which has ID and DateTime columns, so that only df2 rows with DateTime fitting the corresponding ID interval in df1 are kept. I want to do so in a tidyverse framework.
It can easily be done using a join (see example below), but I would like to know whether there would be a more direct solution (maybe purrr-based) that would avoid joining and then removing the time-interval data from the second dataframe. Thanks.

这里发布的问题 如果 x 的时间戳在 y 的时间间隔内,则合并两个数据帧 接近这里提出的问题,但建议的解决方案与我开发的解决方案类似,而不是在 tidyverse 框架中.

The question posted here Merge two dataframes if timestamp of x is within time interval of y is close to the one asked here but proposed solution were similar to the one I developed and not in a tidyverse framework.

显示问题和我当前解决方案的最少代码:

A minimal code to show the problem and my current solution:

library(tibble)  
library(lubridate)

df1 <- tribble(
  ~ID, ~Date1, ~Date2,
  "ID1", "2018-04-16", "2018-06-14",
  "ID2", "2018-04-20", "2018-06-25") 
df1 <- mutate(df1,Interval = interval(ymd(Date1),ymd(Date2)))

df2 <- tribble(
  ~ID, ~DateTime,
  "ID1", "2018-04-12",
  "ID1", "2018-05-05",
  "ID2", "2018-04-23",
  "ID2", "2018-07-12")
df2 <- mutate(df2,DateTime=ymd(DateTime)) 

df1 看起来像这样

> df1
# A tibble: 2 x 4
  ID    Date1      Date2      Interval                      
  <chr> <chr>      <chr>      <S4: Interval>                
1 ID1   2018-04-16 2018-06-14 2018-04-16 UTC--2018-06-14 UTC
2 ID2   2018-04-20 2018-06-25 2018-04-20 UTC--2018-06-25 UTC

和 df2 像这样:

> df2
# A tibble: 4 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-04-12
2 ID1   2018-05-05
3 ID2   2018-04-23
4 ID2   2018-07-12

在 df2 中,ID1 的第二条记录不在 df1 中的 ID1 区间内.ID2 的第二条记录也不在 df1 中的 ID2 区间内.

In df2, the second record for ID1 is not within the ID1 interval in df1. The second record for ID2 is also not within the ID2 interval in df1.

我当前基于加入和删除加入列的解决方案如下:

My current solution based on joining and the removing the joined column follows:

df_out <- df2 %>%
  left_join(.,df1,by="ID") %>%
  filter(.,DateTime %within% Interval) %>%
  select(.,-Interval)

> df_out
# A tibble: 2 x 4
  ID    DateTime   Date1      Date2     
  <chr> <date>     <chr>      <chr>     
1 ID1   2018-05-05 2018-04-16 2018-06-14
2 ID2   2018-04-23 2018-04-20 2018-06-25

我觉得应该存在一个 tidyverse 替代方案,可以避免加入然后删除 Interval 列.

I have the feeling a tidyverse alternative that would avoid joining and then removing the Interval column should exist.

推荐答案

有一个名为 fuzzyjoin 的包,它可以基于一个区间做一个 semi_join.半连接意味着它根据与右"数据帧的匹配来过滤左"数据帧.试试:

There is a package called fuzzyjoin that can do a semi_join based on an interval. Semi join means that it filters the "left" dataframe depending on match to the "right" dataframe. Try:

library(fuzzyjoin)
df2 %>% 
  fuzzy_semi_join(df1, 
                  by=c("DateTime"="Date1", "DateTime"="Date2"),
                  match_fun=list(`>=`, `<=`))

结果如下:

# A tibble: 2 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-05-05
2 ID2   2018-04-23

这篇关于如何在 tidyverse 框架中最有效地过滤另一个数据帧中的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆