通过另一个数据范围过滤一个数据帧的高效方法 [英] Efficient way to filter one data frame by ranges in another

查看:104
本文介绍了通过另一个数据范围过滤一个数据帧的高效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个包含一堆数据的数据框和一个日期/时间列,指示每个数据点何时被收集。我有另一个列出时间跨度的数据框,其中开始列表示每个跨度开始的日期/时间,结束列表示每个范围结束的日期/时间。



我使用简化的数据创建了一个虚拟示例:

  main_data = data.frame(Day = c(1:30))

spans_to_filter =
data.frame(Span_number = c(1:6),
Start = c(2,7,1,15, 12,23),
End = c(5,10,4,18,15,26))

我玩了几个方法来解决这个问题,最终得到以下解决方案:

  require dplyr)
filtered.main_data =
main_data%>%
rowwise()%>%
mutate(present = any(Day> = spans_to_filter $ Start& < = spans_to_filter $ End))%>%
过滤器(存在)%>%
data.frame()

这个工作完全正常,但是我注意到它可能需要一个wh如果我有很多数据(我认为是因为我执行一个比较顺序的比较),那么就不行。我还在学习R的入门,我想知道是否有更有效的方式执行此操作,最好使用dplyr / tidyr?

解决方案

这是一个您可以在 dplyr 中运行以查找日期的功能在之间的函数(从 dplyr )中给定的范围内。对于 Day 的每个值, mapply 在每个之间运行开始结束日期,功能使用 rowSums 返回 TRUE 如果 Day 之间至少有一个。我不知道这是否是最有效的方法,但是速度几乎提高了四倍。

  test (vals,a,b),
spans_to_filter $ Start,spans_to_filter $ End))> 0
}

main_data%>%
过滤器(test.overlap(Day))

如果您正在使用日期(而不是日期时间),则可能会更有效地创建特定日期的向量并测试成员资格(这可能是更好的即使有日期时间):

  filt.vals = as.vector(apply(spans_to_filter,1,function(a) a [开始]:a [结束]))

main_data%>%
过滤器(%过滤器中的日%)

现在比较执行速度。我缩短了您的代码,只需要过滤操作:

 库(微基准)

microbenchmark(
OP = main_data%>%
rowwise()%>%
filter(any(Day> = spans_to_filter $ Start& Day< = spans_to_filter $ End)),
eipi10 = main_data%>%
过滤器(test.overlap(Day)),
eipi10_2 = main_data%>%
过滤器(%filt.vals中的日%)


单位:微秒
expr min lq平均值uq max neval cld
OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c
eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b
eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100 a

更新: 下面是一个更大的数据框架和一些额外的日期范围匹配的测试(感谢@Frank在他现在删除的评论中建议这个) 。事实证明,在这种情况下( mapply / 方法之间的速度增益约为200倍),而对于第二种方法而言,速度增益更大。

  main_data = data.frame(Day = c(1:100000))

spans_to_filter =
data.frame(Span_number = c(1:9),
Start = c(2,7,1,15,12,23,90,9000,50000),
End = c (5,10,4,18,15,26,100,9100,50100))

微基准(
OP = main_data%>%
rowwise()%>%
filter(any(Day> = spans_to_filter $ Start& Day< = spans_to_filter $ End)),
eipi10 = main_data%>%
filter(test.overlap )
eipi10_2 = {
filt.vals = unlist(apply(spans_to_filter,1,function(a)a [Start]:a [End]))
main_data% >%
过滤器(%过滤器中的日%)},
times = 10


单位:毫秒
expr最小lq平均值中位数uq max neval cld
OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b
eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 a
eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10 a


Let's say I have a data frame containing a bunch of data and a date/time column indicating when each data point was collected. I have another data frame that lists time spans, where a "Start" column indicates the date/time when each span starts and an "End" column indicates the date/time when each span ends.

I've created a dummy example below using simplified data:

main_data = data.frame(Day=c(1:30))

spans_to_filter = 
    data.frame(Span_number = c(1:6),
               Start = c(2,7,1,15,12,23),
               End = c(5,10,4,18,15,26))

I toyed around with a few ways of solving this problem and ended up with the following solution:

require(dplyr)    
filtered.main_data =
    main_data %>% 
    rowwise() %>% 
    mutate(present = any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)) %>% 
    filter(present) %>% 
    data.frame()

This works perfectly fine, but I noticed it can take a while to process if I have a lot of data (I assume because I'm performing a row-wise comparison). I'm still learning the ins-and-outs of R and I was wondering if there is a more efficient way of performing this operation, preferably using dplyr/tidyr?

解决方案

Here's a function that you can run in dplyr to find dates within a given range using the between function (from dplyr). For each value of Day, mapply runs between on each of the pairs of Start and End dates and the function uses rowSums to return TRUE if Day is between at least one of them. I'm not sure if it's the most efficient approach, but it results in nearly a factor of four improvement in speed.

test.overlap = function(vals) {
  rowSums(mapply(function(a,b) between(vals, a, b), 
                 spans_to_filter$Start, spans_to_filter$End)) > 0
}

main_data %>% 
  filter(test.overlap(Day))

If you're working with dates (rather than with date-times), it may be even more efficient to create a vector of specific dates and test for membership (this might be a better approach even with date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))

main_data %>% 
  filter(Day %in% filt.vals)

Now compare execution speeds. I shortened your code to require only the filtering operation:

library(microbenchmark)

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = main_data %>% 
    filter(Day %in% filt.vals)
  )

Unit: microseconds
     expr      min       lq      mean    median       uq      max neval cld
       OP 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c
   eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b 
 eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100 a   

UPDATE: Below is a test with a much larger data frame and a few extra date ranges to match (thanks to @Frank for suggesting this in his now-deleted comment). It turns out that the speed gains are far greater in this case (about a factor of 200 for the mapply/between method, and far greater still for the second method).

main_data = data.frame(Day=c(1:100000))

spans_to_filter = 
  data.frame(Span_number = c(1:9),
             Start = c(2,7,1,15,12,23,90,9000,50000),
             End = c(5,10,4,18,15,26,100,9100,50100))

microbenchmark(
  OP=main_data %>% 
    rowwise() %>% 
    filter(any(Day >= spans_to_filter$Start & Day <= spans_to_filter$End)),
  eipi10 = main_data %>% 
    filter(test.overlap(Day)),
  eipi10_2 = {
    filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["Start"]:a["End"]))
    main_data %>% 
      filter(Day %in% filt.vals)}, 
  times=10
  )

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
       OP 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b
   eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10  a 
 eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10  a 

这篇关于通过另一个数据范围过滤一个数据帧的高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆