分组的无密度等级,无遗漏值 [英] Grouped non-dense rank without omitted values

查看:39
本文介绍了分组的无密度等级,无遗漏值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下data.frame:

  df<-data.frame(日期= c(1,1,1,1,2,2,2,2,3,3,3,3),id = c(4,4,2,4,1,2,3,1,2,2,1,1)) 

我想添加一个新列 grp ,该列针对每个日期对ID进行排名.领带应具有相同的值,但不应有遗漏的值.也就是说,如果有两个相等的最小值,则它们都应排在第1位,而下一个最小值应排在第2位.

因此,预期结果将如下所示.请注意,如上所述,分组是针对每个日期的,因此必须按日期对操作进行分组.

  data.frame(date = c(1,1,1,1,2,2,2,2,2,3,3,3,3),id = c(4,4,2,4,1,2,3,1,2,2,1,1),grp = c(2,2,1,2,1,2,3,1,2,2,1,1)) 

我敢肯定有一种简单的方法可以做到这一点,但我还没有找到: tie.method 的所有选项都不以这种方式运行( data.table ::坦率的也无济于事,因为它只会增加密集的排名).

我考虑过进行正常排名,然后使用 data.table :: rleid ,但是如果同一天中存在重复的值并由其他值分隔的情况,则无法正常工作.

我还考虑过按 date id 进行分组,然后使用组ID,但是每天的最低值必须从等级1开始,这样就不会了"也不行.

我发现的唯一功能性解决方案是每天创建另一个具有唯一 ids 的表,然后将该表连接到该表:

  suppressPackageStartupMessages(library(dplyr))df<-data.frame(date = c(1,1,1,1,2,2,2,2,2,3,3,3,3),id = c(4,4,2,4,1,2,3,1,2,2,1,1))不重复<-df%&%;%通过...分组(日期)%&%;%清楚的(ID)%&%;%变异(grp =等级(id))df<-df%&%;%left_join(独特)%>%print()#>通过= c("date","id")加入#>日期id grp#>1 1 4 2#>2 1 4 2#>3 1 2 1#>4 1 4 2#>5 2 1 1#>6 2 2 2#>7 2 3 3#>8 2 1 1#>9 3 2 2#>10 3 2 2#>11 3 1 1#>12 3 1 1 

reprex软件包(v0.3.0)创建于2020-05-08 sup>

但是,对于看似简单的操作而言,这似乎相当不雅且令人费解,所以我宁愿看看是否有其他解决方案可用.

想知道 data.table 解决方案是否可用,但不幸的是,该解决方案必须位于 dplyr 中.

解决方案

我们可以使用 dense_rank

 库(dplyr)df%>%group_by(date)%>%mutate(grp = density_rank(id))#小动作:12 x 3#组:日期[3]#日期id grp#< dbl>< dbl>< int>#1 1 4 2#2 1 4 2#3 1 2 1#4 1 4 2#5 2 1 1#6 2 2 2#7 2 3 3#8 2 1 1#9 3 2 2#10 3 2 2#11 3 1 1#12 3 1 1 


或使用 frank

 库(data.table)setDT(df)[,grp:= frank(id,ties.method ='dense'),date] 

I have the following data.frame:

df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
                 id   = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))

And I want to add a new column grp which, for each date, ranks the IDs. Ties should have the same value, but there should be no omitted values. That is, if there are two values which are equally minimum, they should both get rank 1, and the next lowest values should get rank 2.

The expected result would therefore look like this. Note that, as mentioned, the groups are for each date, so the operation must be grouped by date.

data.frame(date = c(1, 1, 1, 1,     2, 2, 2, 2,     3, 3, 3, 3),
           id   = c(4, 4, 2, 4,     1, 2, 3, 1,     2, 2, 1, 1),
           grp  = c(2, 2, 1, 2,     1, 2, 3, 1,     2, 2, 1, 1))

I'm sure there's a trivial way to do this but I haven't found it: none of the options for tie.method behave in this way (data.table::frank also doesn't help, since it only adds a dense rank).

I thought of doing a normal rank and then using data.table::rleid, but that doesn't work if there are duplicate values separated by other values during the same day.

I also thought of grouping by date and id and then using a group-ID, but the lowest values each day must start at rank 1, so that won't work either.

The only functional solution I've found is to create another table with the unique ids per day and then join that table to this one:

suppressPackageStartupMessages(library(dplyr))

df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
                 id   = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))

uniques <- df %>%
  group_by(
    date
  ) %>%
  distinct(
    id
  ) %>%
  mutate(
    grp = rank(id)
  )

df <- df %>% left_join(
  unique
) %>% print()
#> Joining, by = c("date", "id")
#>    date id grp
#> 1     1  4   2
#> 2     1  4   2
#> 3     1  2   1
#> 4     1  4   2
#> 5     2  1   1
#> 6     2  2   2
#> 7     2  3   3
#> 8     2  1   1
#> 9     3  2   2
#> 10    3  2   2
#> 11    3  1   1
#> 12    3  1   1

Created on 2020-05-08 by the reprex package (v0.3.0)

However, this seems quite inelegant and convoluted for what seems like a simple operation, so I'd rather see if other solutions are available.

Curious to see data.table solutions if available, but unfortunately the solution must be in dplyr.

解决方案

We can use dense_rank

library(dplyr)
df %>%
   group_by(date) %>%
   mutate(grp = dense_rank(id))
# A tibble: 12 x 3
# Groups:   date [3]
#   date    id   grp
#   <dbl> <dbl> <int>
# 1     1     4     2
# 2     1     4     2
# 3     1     2     1
# 4     1     4     2
# 5     2     1     1
# 6     2     2     2
# 7     2     3     3
# 8     2     1     1
# 9     3     2     2
#10     3     2     2
#11     3     1     1
#12     3     1     1


Or with frank

library(data.table)
setDT(df)[, grp := frank(id, ties.method = 'dense'), date]

这篇关于分组的无密度等级,无遗漏值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆