根据现有的行组创建索引列（具有重复项） [英] Create index column based on existent groups of rows (with duplicates)

查看：135 发布时间：2018/8/2 13:34:04 r indexing grouping add col

本文介绍了根据现有的行组创建索引列（具有重复项）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里我的示例data.frame：

Here my example data.frame:

df = read.table(text = 'ID  Day Count   Count_group 
77661   14498   4   5
76552   14498   4   5
37008   14498   4   5
34008   14498   4   5
30004   14497   1   5
30004   14497   1   4   
28047   14496   3   4   
28049   14496   3   4   
29003   14496   3   4   
69012   14468   1   4   
69007   14467   3   4   
69012   14467   3   4   
69020   14467   3   4   
42003   13896   2   4   
42011   13896   2   4   
22001   13895   2   4   
23007   13895   2   4   
28047   14496   3   3   
28049   14496   3   3   
29003   14496   3   3   
69007   14467   3   3   
69012   14467   3   3   
69020   14467   3   3   
48005   14271   2   2   
48007   14271   2   2   
22001   13895   2   2   
23007   13895   2   2   
47011   14320   1   2   
73005   14319   1   2   
73005   14319   1   1', header = TRUE)

计数 col显示按日分组的 ID 值的总和。
Count_group 显示按 Day <分组的唯一 Count vales的总和/ code>和第-1天。


The Count col shows the sum of the ID values grouped by Day.
The Count_group shows the sum of the unique Count vales grouped by Day and Day -1.
我需要创建一个索引列，用于对 Count_group 按日和第-1天按照降序排列 df （有重复！）。
I need to create an index column which groups the Count_group by Day and Day -1 following the descending order of the df (with duplicates!).
这是我的预期输出：
ID     Day  Count Count_group index_col
77661   14498   4   5           1
76552   14498   4   5           1
37008   14498   4   5           1
34008   14498   4   5           1
30004   14497   1   5           1
30004   14497   1   4           2
28047   14496   3   4           2
28049   14496   3   4           2
29003   14496   3   4           2
69012   14468   1   4           3
69007   14467   3   4           3
69012   14467   3   4           3
69020   14467   3   4           3
42003   13896   2   4           4
42011   13896   2   4           4
22001   13895   2   4           4
23007   13895   2   4           4
28047   14496   3   3           5
28049   14496   3   3           5
29003   14496   3   3           5
69007   14467   3   3           6
69012   14467   3   3           6
69020   14467   3   3           6
48005   14271   2   2           7
48007   14271   2   2           7
22001   13895   2   2           8
23007   13895   2   2           8
47011   14320   1   2           9
73005   14319   1   2           9
73005   14319   1   1          10

并且使用 index_col 分组3天：日，第-1天和第2天：
And do the same but with index_col grouping by 3 days: Day, Day -1 and Day -2:
    df_2 = read.table(text = 'ID Day Count Count_group
30004   14497   1   5
28047   14496   3   5
28049   14496   3   5
29003   14496   3   5
69012   14495   1   5
69007   14467   3   5
69012   14467   3   5
69020   14467   3   5
42003   14466   1   5
42011   14465   1   5
28047   14496   3   4
28049   14496   3   4
29003   14496   3   4
69012   14995   1   4
22001   13895   2   4
23007   13895   2   4
28047   13894   2   4
28049   13894   2   4
42003   14466   1   2
42011   14465   1   2
28047   13894   2   2
28049   13894   2   2
69012   14995   1   1
42011   14465   1   1', header = TRUE)

预期产出：
ID     Day  Count Count_group index_col
30004   14497   1   5           1
28047   14496   3   5           1
28049   14496   3   5           1
29003   14496   3   5           1
69012   14495   1   5           1
69007   14467   3   5           2
69012   14467   3   5           2
69020   14467   3   5           2
42003   14466   1   5           2
42011   14465   1   5           2
28047   14496   3   4           3
28049   14496   3   4           3
29003   14496   3   4           3
69012   14995   1   4           3
22001   13895   2   4           4
23007   13895   2   4           4
28047   13894   2   4           4
28049   13894   2   4           4
42003   14466   1   2           5
42011   14465   1   2           5
28047   13894   2   2           6
28049   13894   2   2           6
69012   14995   1   1           7
42011   14465   1   1           8

您有什么建议吗？ 
我希望创建一个通用代码，可以应用（通过一些调整）df，df_2和其他带有n天分组变量的data.frames。
Do you have any suggestion?
I desire to create a generic code that could be applied (with a few adjustments) to both df, df_2 and to other data.frames with grouping variable of n days.
推荐答案
使用 dplyr ：
df %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

 说明：
 c(+Inf,diff(Day))

如你想连续两天，我计算差异在日与差异（日）。当 diff 返回大小 n-1 的向量时，我必须为向量的顶部添加一个值，I选择 + Inf 。
As you want two consecutive days, I compute the difference on the Day with diff(Day). As diff return vector of size n-1, I have to add a value for the top of the vector, I choose +Inf.
!(... %in% c(0,-1))

我测试的值是否相同日或第1天因为它们必须分组，我想要的情况并非如此。
I test that the value is the same Day or Day-1 as they must be grouped, I want when it's not the case.
cumsum(...)

最后，我使用 cumsum 来了解发生了多少变化。
Finally, I use cumsum to know how many of change occur.
 输出： 
这适用于您的两个例子
> df %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

      ID   Day Count Count_group index_col
1  30004 14497     1           4         1
2  28047 14496     3           4         1
3  28049 14496     3           4         1
4  29003 14496     3           4         1
5  69012 14468     1           4         2
6  69007 14467     3           4         2
7  69012 14467     3           4         2
8  69020 14467     3           4         2
9  42003 13896     2           4         3
10 42011 13896     2           4         3
11 22001 13895     2           4         3
12 23007 13895     2           4         3
13 28047 14496     3           3         4
14 28049 14496     3           3         4
15 29003 14496     3           3         4
16 69007 14467     3           3         5
17 69012 14467     3           3         5
18 69020 14467     3           3         5
19 48005 14271     2           2         6
20 48007 14271     2           2         6
21 22001 13895     2           2         7
22 23007 13895     2           2         7
23 47011 14320     1           2         8
24 73005 14319     1           2         8
25 73005 14319     1           1         8

和
> df_2 %>% mutate(index_col = cumsum(!c(+Inf,diff(Day))%in%c(0,-1)))

      ID   Day Count Count_group index_col
1  30004 14497     1           5         1
2  28047 14496     3           5         1
3  28049 14496     3           5         1
4  29003 14496     3           5         1
5  69012 14495     1           5         1
6  69007 14467     3           5         2
7  69012 14467     3           5         2
8  69020 14467     3           5         2
9  42003 14466     1           5         2
10 42011 14465     1           5         2
11 28047 14496     3           4         3
12 28049 14496     3           4         3
13 29003 14496     3           4         3
14 69012 14495     1           4         3
15 22001 13895     2           4         4
16 23007 13895     2           4         4
17 28047 13894     2           4         4
18 28049 13894     2           4         4
19 42003 14466     1           2         5
20 42011 14465     1           2         5
21 28047 13894     2           2         6
22 28049 13894     2           2         6
23 69012 14995     1           1         7
24 42011 14465     1           1         8


                        这篇关于根据现有的行组创建索引列（具有重复项）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

根据现有的行组创建索引列（具有重复项） [英] Create index column based on existent groups of rows (with duplicates)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据现有的行组创建索引列（具有重复项） [英] Create index column based on existent groups of rows (with duplicates)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭