使用dplyr筛选窗口:查找匹配行,并保留后续N行 [英] Filter window with dplyr: find matching row, and keep subsequent N rows
问题描述
set.seed(3)
df< -
data.frame(hour = 11,minutes = runif(12,0,59),count = rpois(12,3))%>%
arrange(hour,minutes)
其中生成
> df
小时计数
1 0 9.914450 3
2 1 47.643468 3
3 2 22.711599 5
4 3 19.336325 5
5 4 35.523940 1
6 5 35.659249 4
7 6 7.353373 5
8 7 17.381455 2
9 8 34.078985 2
10 9 37.227777 0
11 10 30.208938 1
12 11 29.796411 1
普通过滤器返回两行:
> df%>%
+过滤器(小时%% 6 == 0)
小时计数
1 0 9.914450 3
2 6 7.353373 5
但答案应该是:
小时分钟数
/ pre>
1 0 9.914450 3
2 1 47.643468 3
3 6 7.353373 5
4 7 17.381455 2
在这种情况下,可以对用于过滤的列使用模算法,但在一般情况下,这可能是不可能的。
原始示例在下面提供,在这里我想要在每个小时内的前两个记录。在这种情况下,Akrun的答案是好的,利用数据中的组结构。
例如library(dplyr)
set.seed(0)
df< -
data.frame(hour = rep(0:11,3),minutes = runif(36,0,59),count = rpois(36,3))%>%
arrange小时,分钟)
看起来像:
小时数计数
1 0 7.4077507 2
2 0 10.4168484 3
3 0 52.9051348 4
4 1 15.6650111 4
5 1 15.7660195 5
6 1 40.5343480 4
7 2 21.9553101 1
8 2 22.6621194 4
9 2 22.7807315 2
10 3 0.7900297 3
11 3 33.7983484 4
12 3 45.4206438 3
...
p>
df%>%mutate(is_even_hour = ifelse(小时%% 2 == 0,1,0))%>%
过滤器(is_even_hour == 1)%>%
group_by(hour,is_even_hour)%>%
filter(row_number()< = 2)%>%
取消分组%>%
select(-is_ even_hour)
其中
小时分计数
< int> < DBL> < INT>
1 0 7.407751 2
2 0 10.416848 3
3 2 21.955310 1
4 2 22.662119 4
5 4 22.560889 2
6 4 29.364255 5
7 6 20.080591 2
8 6 53.004991 3
9 8 35.374384 4
10 8 38.987070 3
11 10 3.645390 4
12 10 10.986838 5
解决方案按小时分组后,我们可以在单个
过滤器
步骤df%>%
group_by )%>%
过滤器(!小时%% 2& row_number()< 3)
#小时分钟计数
#< int> < DBL> < INT>
#1 0 7.407751 2
#2 0 10.416848 3
#3 2 21.955310 1
#4 2 22.662119 4
#5 4 22.560889 2
# 6 4 29.364255 5
#7 6 20.080591 2
#8 6 53.004991 3
#9 8 35.374384 4
#10 8 38.987070 3
#11 10 3.645390 4
#12 10 10.986838 5
对于更新的帖子
i1< - df%>%
过滤器(小时%% 6 == 0)% >%
。$ hour%>%
rep(。,each = 2)+ 0:1%>%
match(。,df $ hour)
df [i1,]
#小时数计数
#1 0 9.914450 3
#2 1 47.643468 3
#7 6 7.353373 5
#8 7 17.381455 2
或者可以用紧凑的方式完成
data.table
library(data.table)
setDT (df)[df [,rep(which(!hour %% 6),每个= 2)+ 0:1]]
#小时分数计数
#1:0 9.914450 3
#2:1 47.643468 3
#3:6 7.353373 5
#4:7 17.381455 2
I have a dataframe and I would like to filter out rows that match some condition, and the subsequent N rows following it. For example, consider a data frame which contains a hour and minutes column (representing a timestamp per row). Let's say I would like the first two records after the 0th and 6th hour. Is it possible to do this in a nice way?
set.seed(3) df <- data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>% arrange(hour, minutes)
which produces
> df hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 2 22.711599 5 4 3 19.336325 5 5 4 35.523940 1 6 5 35.659249 4 7 6 7.353373 5 8 7 17.381455 2 9 8 34.078985 2 10 9 37.227777 0 11 10 30.208938 1 12 11 29.796411 1
The normal filter returns two rows:
> df %>% + filter(hour%%6 == 0) hour minutes count 1 0 9.914450 3 2 6 7.353373 5
However, the answer should be:
hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 6 7.353373 5 4 7 17.381455 2
In this case it is possible to use modulo arithmetic on the column used for filtering, but in the general case this may not possible.
The original example is provided below, where by here I wanted the first two records in each hour. In this case, Akrun's answer is good and exploits the group structure in the data. E.g.
library(dplyr) set.seed(0) df <- data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>% arrange(hour, minutes)
looks like:
hour minutes count 1 0 7.4077507 2 2 0 10.4168484 3 3 0 52.9051348 4 4 1 15.6650111 4 5 1 15.7660195 5 6 1 40.5343480 4 7 2 21.9553101 1 8 2 22.6621194 4 9 2 22.7807315 2 10 3 0.7900297 3 11 3 33.7983484 4 12 3 45.4206438 3 ...
One could do
df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>% filter(is_even_hour == 1) %>% group_by(hour, is_even_hour) %>% filter(row_number() <= 2) %>% ungroup %>% select(-is_even_hour)
which gives
hour minutes count <int> <dbl> <int> 1 0 7.407751 2 2 0 10.416848 3 3 2 21.955310 1 4 2 22.662119 4 5 4 22.560889 2 6 4 29.364255 5 7 6 20.080591 2 8 6 53.004991 3 9 8 35.374384 4 10 8 38.987070 3 11 10 3.645390 4 12 10 10.986838 5
解决方案After grouping by 'hour', we can do this in a single
filter
stepdf %>% group_by(hour) %>% filter(!hour%%2 & row_number() <3) # hour minutes count # <int> <dbl> <int> #1 0 7.407751 2 #2 0 10.416848 3 #3 2 21.955310 1 #4 2 22.662119 4 #5 4 22.560889 2 #6 4 29.364255 5 #7 6 20.080591 2 #8 6 53.004991 3 #9 8 35.374384 4 #10 8 38.987070 3 #11 10 3.645390 4 #12 10 10.986838 5
For the updated post
i1 <- df %>% filter(hour%%6 == 0) %>% .$hour %>% rep(., each =2)+ 0:1 %>% match(., df$hour) df[i1,] # hour minutes count #1 0 9.914450 3 #2 1 47.643468 3 #7 6 7.353373 5 #8 7 17.381455 2
Or this can be done in a compact way with
data.table
library(data.table) setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]] # hour minutes count #1: 0 9.914450 3 #2: 1 47.643468 3 #3: 6 7.353373 5 #4: 7 17.381455 2
这篇关于使用dplyr筛选窗口:查找匹配行,并保留后续N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!