使用dplyr筛选窗口：查找匹配行，并保留后续N行 [英] Filter window with dplyr: find matching row, and keep subsequent N rows

查看：128 发布时间：2017/7/13 22:01:06 r dplyr

本文介绍了使用dplyr筛选窗口：查找匹配行，并保留后续N行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，我想过滤出与某些条件匹配的行，以及随后的N行。例如，考虑一个数据框，其中包含一个小时和分钟列（表示每行的时间戳）。假设我想在第0和第6小时之后的之前的前两个记录。有可能这样做很好吗？
set.seed（3） df< - data.frame（hour = 11，minutes = runif（12，0，59），count = rpois（12,3））％>％ arrange（hour，minutes）
其中生成
> df 小时计数 1 0 9.914450 3 2 1 47.643468 3 3 2 22.711599 5 4 3 19.336325 5 5 4 35.523940 1 6 5 35.659249 4 7 6 7.353373 5 8 7 17.381455 2 9 8 34.078985 2 10 9 37.227777 0 11 10 30.208938 1 12 11 29.796411 1
普通过滤器返回两行：
> df％>％ +过滤器（小时%% 6 == 0）小时计数 1 0 9.914450 3 2 6 7.353373 5
但答案应该是：
小时分钟数 1 0 9.914450 3 2 1 47.643468 3 3 6 7.353373 5 4 7 17.381455 2 / pre>

在这种情况下，可以对用于过滤的列使用模算法，但在一般情况下，这可能是不可能的。

原始示例在下面提供，在这里我想要在每个小时内的前两个记录。在这种情况下，Akrun的答案是好的，利用数据中的组结构。
例如
library（dplyr） set.seed（0） df< - data.frame（hour = rep（0:11,3），minutes = runif（36，0，59），count = rpois（36,3））％>％ arrange小时，分钟）
看起来像：
小时数计数 1 0 7.4077507 2 2 0 10.4168484 3 3 0 52.9051348 4 4 1 15.6650111 4 5 1 15.7660195 5 6 1 40.5343480 4 7 2 21.9553101 1 8 2 22.6621194 4 9 2 22.7807315 2 10 3 0.7900297 3 11 3 33.7983484 4 12 3 45.4206438 3 ...
p>

df％>％mutate（is_even_hour = ifelse（小时%% 2 == 0，1，0））％>％过滤器（is_even_hour == 1）％>％ group_by（hour，is_even_hour）％>％ filter（row_number（）< = 2）％>％取消分组％>％ select（-is_ even_hour）
其中
小时分计数 < int> < DBL> < INT> 1 0 7.407751 2 2 0 10.416848 3 3 2 21.955310 1 4 2 22.662119 4 5 4 22.560889 2 6 4 29.364255 5 7 6 20.080591 2 8 6 53.004991 3 9 8 35.374384 4 10 8 38.987070 3 11 10 3.645390 4 12 10 10.986838 5

解决方案
按小时分组后，我们可以在单个过滤器步骤
df％>％ group_by ）％>％过滤器（！小时%% 2& row_number（）< 3）＃小时分钟计数＃< int> < DBL> < INT> ＃1 0 7.407751 2 ＃2 0 10.416848 3 ＃3 2 21.955310 1 ＃4 2 22.662119 4 ＃5 4 22.560889 2 ＃ 6 4 29.364255 5 ＃7 6 20.080591 2 ＃8 6 53.004991 3 ＃9 8 35.374384 4 ＃10 8 38.987070 3 ＃11 10 3.645390 4 ＃12 10 10.986838 5

对于更新的帖子
i1< - df％>％过滤器（小时%% 6 == 0）％ >％。$ hour％>％ rep（。，each = 2）+ 0：1％>％ match（。，df $ hour） df [i1，] ＃小时数计数＃1 0 9.914450 3 ＃2 1 47.643468 3 ＃7 6 7.353373 5 ＃8 7 17.381455 2

或者可以用紧凑的方式完成 data.table
library（data.table） setDT （df）[df [，rep（which（！hour %% 6），每个= 2）+ 0：1]] ＃小时分数计数＃1：0 9.914450 3 ＃2：1 47.643468 3 ＃3：6 7.353373 5 ＃4：7 17.381455 2

I have a dataframe and I would like to filter out rows that match some condition, and the subsequent N rows following it. For example, consider a data frame which contains a hour and minutes column (representing a timestamp per row). Let's say I would like the first two records after the 0th and 6th hour. Is it possible to do this in a nice way?
set.seed(3) df <- data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>% arrange(hour, minutes)
which produces
> df hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 2 22.711599 5 4 3 19.336325 5 5 4 35.523940 1 6 5 35.659249 4 7 6 7.353373 5 8 7 17.381455 2 9 8 34.078985 2 10 9 37.227777 0 11 10 30.208938 1 12 11 29.796411 1
The normal filter returns two rows:
> df %>% + filter(hour%%6 == 0) hour minutes count 1 0 9.914450 3 2 6 7.353373 5
However, the answer should be:
hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 6 7.353373 5 4 7 17.381455 2
In this case it is possible to use modulo arithmetic on the column used for filtering, but in the general case this may not possible.

The original example is provided below, where by here I wanted the first two records in each hour. In this case, Akrun's answer is good and exploits the group structure in the data. E.g.
library(dplyr) set.seed(0) df <- data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>% arrange(hour, minutes)
looks like:
hour minutes count 1 0 7.4077507 2 2 0 10.4168484 3 3 0 52.9051348 4 4 1 15.6650111 4 5 1 15.7660195 5 6 1 40.5343480 4 7 2 21.9553101 1 8 2 22.6621194 4 9 2 22.7807315 2 10 3 0.7900297 3 11 3 33.7983484 4 12 3 45.4206438 3 ...
One could do
df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>% filter(is_even_hour == 1) %>% group_by(hour, is_even_hour) %>% filter(row_number() <= 2) %>% ungroup %>% select(-is_even_hour)
which gives
hour minutes count <int> <dbl> <int> 1 0 7.407751 2 2 0 10.416848 3 3 2 21.955310 1 4 2 22.662119 4 5 4 22.560889 2 6 4 29.364255 5 7 6 20.080591 2 8 6 53.004991 3 9 8 35.374384 4 10 8 38.987070 3 11 10 3.645390 4 12 10 10.986838 5

解决方案
After grouping by 'hour', we can do this in a single filter step
df %>% group_by(hour) %>% filter(!hour%%2 & row_number() <3) # hour minutes count # <int> <dbl> <int> #1 0 7.407751 2 #2 0 10.416848 3 #3 2 21.955310 1 #4 2 22.662119 4 #5 4 22.560889 2 #6 4 29.364255 5 #7 6 20.080591 2 #8 6 53.004991 3 #9 8 35.374384 4 #10 8 38.987070 3 #11 10 3.645390 4 #12 10 10.986838 5

For the updated post
i1 <- df %>% filter(hour%%6 == 0) %>% .$hour %>% rep(., each =2)+ 0:1 %>% match(., df$hour) df[i1,] # hour minutes count #1 0 9.914450 3 #2 1 47.643468 3 #7 6 7.353373 5 #8 7 17.381455 2

Or this can be done in a compact way with data.table
library(data.table) setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]] # hour minutes count #1: 0 9.914450 3 #2: 1 47.643468 3 #3: 6 7.353373 5 #4: 7 17.381455 2

这篇关于使用dplyr筛选窗口：查找匹配行，并保留后续N行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用dplyr筛选窗口：查找匹配行，并保留后续N行 [英] Filter window with dplyr: find matching row, and keep subsequent N rows

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

使用dplyr筛选窗口：查找匹配行，并保留后续N行 [英] Filter window with dplyr: find matching row, and keep subsequent N rows

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭