通过连接值对R数据帧进行分组 [英] Grouping of R dataframe by connected values

查看:227
本文介绍了通过连接值对R数据帧进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有在R中找到此常见分组问题的解决方案:



这是我的原始数据集

  ID状态
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C

这应该是我分组的数据集

 状态最小值(ID)最大值(ID)
A 1 2
B 3 5
A 6 8
C 9 10

所以这个想法是首先按照ID列(或时间戳列)对数据集进行排序。然后,所有没有间隙的连接状态都应该分组在一起,并返回最小和最大ID值。它与rle方法有关,但这不允许计算组的最小值,最大值。



任何想法?

解决方案

您可以尝试:

  library(dplyr)
df%>%
mutate(rleid = cumsum(State!= lag(State,default =)))%>%
group_by(rleid)%>%
summaryize(State = first(State),min = min(ID),max = max(ID))%>%
select(-rleid)
/ pre>




或者根据@alistaire在评论中提到的,您实际上可以在 group_by()具有相同的语法,组合前两个步骤。窃取 data.table :: rleid()并使用 summarise_all()来简化:

  df%>%
group_by(State,rleid = data.table :: rleid(State))%>%
summarise_all(funs(min,max))%>%
select(-rleid)

哪个给出:

  ##一个琐事:4×3 
#状态最小
# ; fctr> < int> < int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10


I didn't find a solution for this common grouping problem in R:

This is my original dataset

ID  State
1   A
2   A
3   B
4   B
5   B
6   A
7   A
8   A
9   C
10  C

This should be my grouped resulting dataset

State   min(ID) max(ID)
A       1       2
B       3       5
A       6       8
C       9       10

So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?

解决方案

You could try:

library(dplyr)
df %>%
  mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
  group_by(rleid) %>%
  summarise(State = first(State), min = min(ID), max = max(ID)) %>%
  select(-rleid)


Or as per mentioned by @alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:

df %>% 
  group_by(State, rleid = data.table::rleid(State)) %>% 
  summarise_all(funs(min, max)) %>% 
  select(-rleid)

Which gives:

## A tibble: 4 × 3
#   State   min   max
#  <fctr> <int> <int>
#1      A     1     2
#2      B     3     5
#3      A     6     8
#4      C     9    10

这篇关于通过连接值对R数据帧进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆