通过连接值对R数据帧进行分组 [英] Grouping of R dataframe by connected values
问题描述
我没有在R中找到此常见分组问题的解决方案:
这是我的原始数据集
ID状态
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
这应该是我分组的数据集
状态最小值(ID)最大值(ID)
A 1 2
B 3 5
A 6 8
C 9 10
所以这个想法是首先按照ID列(或时间戳列)对数据集进行排序。然后,所有没有间隙的连接状态都应该分组在一起,并返回最小和最大ID值。它与rle方法有关,但这不允许计算组的最小值,最大值。
任何想法?
您可以尝试:
library(dplyr)
/ pre>
df%>%
mutate(rleid = cumsum(State!= lag(State,default =)))%>%
group_by(rleid)%>%
summaryize(State = first(State),min = min(ID),max = max(ID))%>%
select(-rleid)
或者根据@alistaire在评论中提到的,您实际上可以在
group_by()
具有相同的语法,组合前两个步骤。窃取data.table :: rleid()
并使用summarise_all()
来简化:df%>%
group_by(State,rleid = data.table :: rleid(State))%>%
summarise_all(funs(min,max))%>%
select(-rleid)
哪个给出:
##一个琐事:4×3
#状态最小
# ; fctr> < int> < int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State 1 A 2 A 3 B 4 B 5 B 6 A 7 A 8 A 9 C 10 C
This should be my grouped resulting dataset
State min(ID) max(ID) A 1 2 B 3 5 A 6 8 C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
解决方案You could try:
library(dplyr) df %>% mutate(rleid = cumsum(State != lag(State, default = ""))) %>% group_by(rleid) %>% summarise(State = first(State), min = min(ID), max = max(ID)) %>% select(-rleid)
Or as per mentioned by @alistaire in the comments, you can actually mutate within
group_by()
with the same syntax, combining the first two steps. Stealingdata.table::rleid()
and usingsummarise_all()
to simplify:df %>% group_by(State, rleid = data.table::rleid(State)) %>% summarise_all(funs(min, max)) %>% select(-rleid)
Which gives:
## A tibble: 4 × 3 # State min max # <fctr> <int> <int> #1 A 1 2 #2 B 3 5 #3 A 6 8 #4 C 9 10
这篇关于通过连接值对R数据帧进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!