使用dplyr筛选窗口:查找匹配行,并保留后续N行 [英] Filter window with dplyr: find matching row, and keep subsequent N rows

查看:128
本文介绍了使用dplyr筛选窗口:查找匹配行,并保留后续N行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想过滤出与某些条件匹配的行,以及随后的N行。例如,考虑一个数据框,其中包含一个小时和分钟列(表示每行的时间戳)。假设我想在第0和第6小时之后的之前的前两个记录。有可能这样做很好吗?

  set.seed(3)
df< -
data.frame(hour = 11,minutes = runif(12,0,59),count = rpois(12,3))%>%
arrange(hour,minutes)

其中生成

 > df 
小时计数
1 0 9.914450 3
2 1 47.643468 3
3 2 22.711599 5
4 3 19.336325 5
5 4 35.523940 1
6 5 35.659249 4
7 6 7.353373 5
8 7 17.381455 2
9 8 34.078985 2
10 9 37.227777 0
11 10 30.208938 1
12 11 29.796411 1

普通过滤器返回两行:

 > df%>%
+过滤器(小时%% 6 == 0)
小时计数
1 0 9.914450 3
2 6 7.353373 5

但答案应该是:

 小时分钟数
1 0 9.914450 3
2 1 47.643468 3
3 6 7.353373 5
4 7 17.381455 2
/ pre>

在这种情况下,可以对用于过滤的列使用模算法,但在一般情况下,这可能是不可能的。



原始示例在下面提供,在这里我想要在每个小时内的前两个记录。在这种情况下,Akrun的答案是好的,利用数据中的组结构。
例如

  library(dplyr)
set.seed(0)
df< -
data.frame(hour = rep(0:11,3),minutes = runif(36,0,59),count = rpois(36,3))%>%
arrange小时,分钟)

看起来像:

 小时数计数
1 0 7.4077507 2
2 0 10.4168484 3
3 0 52.9051348 4
4 1 15.6650111 4
5 1 15.7660195 5
6 1 40.5343480 4
7 2 21.9553101 1
8 2 22.6621194 4
9 2 22.7807315 2
10 3 0.7900297 3
11 3 33.7983484 4
12 3 45.4206438 3
...

p>

  df%>%mutate(is_even_hour = ifelse(小时%% 2 == 0,1,0))%>% 
过滤器(is_even_hour == 1)%>%
group_by(hour,is_even_hour)%>%
filter(row_number()< = 2)%>%
取消分组%>%
select(-is_ even_hour)

其中

 小时分计数
< int> < DBL> < INT>
1 0 7.407751 2
2 0 10.416848 3
3 2 21.955310 1
4 2 22.662119 4
5 4 22.560889 2
6 4 29.364255 5
7 6 20.080591 2
8 6 53.004991 3
9 8 35.374384 4
10 8 38.987070 3
11 10 3.645390 4
12 10 10.986838 5


解决方案

按小时分组后,我们可以在单个过滤器步骤

  df%>%
group_by )%>%
过滤器(!小时%% 2& row_number()< 3)
#小时分钟计数
#< int> < DBL> < INT>
#1 0 7.407751 2
#2 0 10.416848 3
#3 2 21.955310 1
#4 2 22.662119 4
#5 4 22.560889 2
# 6 4 29.364255 5
#7 6 20.080591 2
#8 6 53.004991 3
#9 8 35.374384 4
#10 8 38.987070 3
#11 10 3.645390 4
#12 10 10.986838 5






对于更新的帖子

  i1<  -  df%>%
过滤器(小时%% 6 == 0)% >%
。$ hour%>%
rep(。,each = 2)+ 0:1%>%
match(。,df $ hour)
df [i1,]
#小时数计数
#1 0 9.914450 3
#2 1 47.643468 3
#7 6 7.353373 5
#8 7 17.381455 2






或者可以用紧凑的方式完成 data.table

  library(data.table)
setDT (df)[df [,rep(which(!hour %% 6),每个= 2)+ 0:1]]
#小时分数计数
#1:0 9.914450 3
#2:1 47.643468 3
#3:6 7.353373 5
#4:7 17.381455 2


I have a dataframe and I would like to filter out rows that match some condition, and the subsequent N rows following it. For example, consider a data frame which contains a hour and minutes column (representing a timestamp per row). Let's say I would like the first two records after the 0th and 6th hour. Is it possible to do this in a nice way?

set.seed(3)
df <- 
    data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>%
    arrange(hour, minutes)

which produces

> df
   hour   minutes count
1     0  9.914450     3
2     1 47.643468     3
3     2 22.711599     5
4     3 19.336325     5
5     4 35.523940     1
6     5 35.659249     4
7     6  7.353373     5
8     7 17.381455     2
9     8 34.078985     2
10    9 37.227777     0
11   10 30.208938     1
12   11 29.796411     1

The normal filter returns two rows:

> df %>%
+     filter(hour%%6 == 0)
  hour  minutes count
1    0 9.914450     3
2    6 7.353373     5

However, the answer should be:

  hour   minutes count
1    0  9.914450     3
2    1 47.643468     3
3    6  7.353373     5
4    7 17.381455     2

In this case it is possible to use modulo arithmetic on the column used for filtering, but in the general case this may not possible.

The original example is provided below, where by here I wanted the first two records in each hour. In this case, Akrun's answer is good and exploits the group structure in the data. E.g.

library(dplyr)
set.seed(0)
df <- 
    data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>%
    arrange(hour, minutes)

looks like:

   hour    minutes count
1     0  7.4077507     2
2     0 10.4168484     3
3     0 52.9051348     4
4     1 15.6650111     4
5     1 15.7660195     5
6     1 40.5343480     4
7     2 21.9553101     1
8     2 22.6621194     4
9     2 22.7807315     2
10    3  0.7900297     3
11    3 33.7983484     4
12    3 45.4206438     3
...

One could do

df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>%
    filter(is_even_hour == 1) %>%
    group_by(hour, is_even_hour) %>%
    filter(row_number() <= 2) %>%
    ungroup %>%
    select(-is_even_hour)

which gives

hour   minutes count
   <int>     <dbl> <int>
1      0  7.407751     2
2      0 10.416848     3
3      2 21.955310     1
4      2 22.662119     4
5      4 22.560889     2
6      4 29.364255     5
7      6 20.080591     2
8      6 53.004991     3
9      8 35.374384     4
10     8 38.987070     3
11    10  3.645390     4
12    10 10.986838     5

解决方案

After grouping by 'hour', we can do this in a single filter step

df %>%
     group_by(hour) %>%
     filter(!hour%%2 & row_number() <3)
#     hour   minutes count
#    <int>     <dbl> <int>
#1      0  7.407751     2
#2      0 10.416848     3
#3      2 21.955310     1
#4      2 22.662119     4
#5      4 22.560889     2
#6      4 29.364255     5
#7      6 20.080591     2
#8      6 53.004991     3
#9      8 35.374384     4
#10     8 38.987070     3
#11    10  3.645390     4
#12    10 10.986838     5


For the updated post

i1 <- df %>% 
          filter(hour%%6 == 0) %>%
          .$hour %>% 
          rep(., each =2)+ 0:1 %>% 
          match(., df$hour) 
df[i1,]
#   hour   minutes count
#1    0  9.914450     3
#2    1 47.643468     3
#7    6  7.353373     5
#8    7 17.381455     2


Or this can be done in a compact way with data.table

library(data.table)
setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]]
#   hour   minutes count
#1:    0  9.914450     3
#2:    1 47.643468     3
#3:    6  7.353373     5
#4:    7 17.381455     2

这篇关于使用dplyr筛选窗口:查找匹配行,并保留后续N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆