使用Group_by在R中的前5名和后5名 [英] Top 5 and bottom 5 in r using Group_by

查看:97
本文介绍了使用Group_by在R中的前5名和后5名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找将值分配给5个最高值和5个最低值的代码或功能。例如,这可能来自类似以下的数据集:

I am looking for a code or feature that assigns a value to the 5 highest values and 5 lowest values. This could, for example, be from a dataset similar to this:

df <- data.frame(
   Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)), 
   Value=c(rep(c(1,2,3,4,5,6,7,8,9,NA,NA,NA,NA,NA,15),2))
)

编辑:这只是一个示例数据。我使用的数据更加复杂,因此代码应允许Date列的长度不同,并且允许丢失多个值(NA)。

This is just a sample data. The data I use is more complex and the code should, therefore, allow for varying lengths of the column Date and also for multiple values that are missing (NAs).

I然后希望将一个值分配给五个最低值,分别等于 5w,将 5b分配给五个最高值。然后应根据日期将数据包装在group_by中,以便在每个周期重复该过程。我尝试使用百分位数,但是此方法在每个括号中都不能保持恒定数量的值。因此,我正在寻找一种方法,该方法允许每个方括号中的值数量恒定。如果可能的话,最好将所有公司放在5%的括号内。通过这个,我的意思是在所有公司分布的情况下有20个括号。这意味着最好的支架将由总价值最高的公司的5%组成。值可以是0:19。也就是说,在最高值范围内的公司的预期产出为19,而在最低值范围内的公司的预期产出为0。

I would then like a value assigned to the five lowest equal to "5w" and "5b" to the 5 highest values The data should then be wrapped in a group_by based on the date so that the process is repeated at each period. I have tried using percentile but this method does not maintain a constant number of values in each bracket. I am therefore looking for a method that allows the number of values in each bracket to be constant. If it is possible it would be nice to put all firms into 5% brackets. By this, I mean to have 20 brackets with all firms distributed. This means that the best bracket would consist of 5% of total firms with the highest value. The values could be 0:19. I.e meaning that the expected output of a firm in the highest value bracket would be 19 and a firm in the lowest bracket would receive a value of 0.

在此先感谢

推荐答案

注意:虽然我怀疑这只是示例数据,但您有两个 1 2010-01-31 中的c $ c>。该代码说明了这一点,但是当未排序时,输出看起来很奇怪。为此,我要添加 arrange 来显示它们。

Heads up: while I suspect that this is just sample data, you have two 1s in 2010-01-31. This code accounts for that, but when unsorted the output looks odd. For that, I'm adding arrange to show them.

我使用 min_rank ,假设您不想要领带,并且总是想要顶部/底部5。另一种选择是 dense_rank ,在 2010-01-31 中标记顶部的 6 ,原因是并列 1

I use min_rank here, assuming that you do not want ties and always want top/bottom 5. An alternative is dense_rank, which would label the top six from 2010-01-31 due to tie for 1.

library(dpyr)
dat %>%
  group_by(Date) %>%
  mutate(
    R = min_rank(Value),
    Quux = case_when(
      R < 6       ~ "5w",
      R > n() - 5 ~ "5b",
      TRUE        ~ NA_character_)
    ) %>%
  ungroup() %>%
  arrange(Date, Value) %>%
  print(n=99)
# # A tibble: 30 x 4
#    Date       Value     R Quux 
#    <fct>      <int> <int> <chr>
#  1 2010-01-31     1     1 5w   
#  2 2010-01-31     1     1 5w   
#  3 2010-01-31     2     3 5w   
#  4 2010-01-31     3     4 5w   
#  5 2010-01-31     4     5 5w   
#  6 2010-01-31     5     6 <NA> 
#  7 2010-01-31     6     7 <NA> 
#  8 2010-01-31     7     8 <NA> 
#  9 2010-01-31     8     9 <NA> 
# 10 2010-01-31     9    10 <NA> 
# 11 2010-01-31    10    11 <NA> 
# 12 2010-01-31    11    12 5b   
# 13 2010-01-31    12    13 5b   
# 14 2010-01-31    13    14 5b   
# 15 2010-01-31    14    15 5b   
# 16 2010-01-31    15    16 5b   
# 17 2010-02-28     2     1 5w   
# 18 2010-02-28     3     2 5w   
# 19 2010-02-28     4     3 5w   
# 20 2010-02-28     5     4 5w   
# 21 2010-02-28     6     5 5w   
# 22 2010-02-28     7     6 <NA> 
# 23 2010-02-28     8     7 <NA> 
# 24 2010-02-28     9     8 <NA> 
# 25 2010-02-28    10     9 <NA> 
# 26 2010-02-28    11    10 5b   
# 27 2010-02-28    12    11 5b   
# 28 2010-02-28    13    12 5b   
# 29 2010-02-28    14    13 5b   
# 30 2010-02-28    15    14 5b   






编辑,使用新发现的数据。我推断 NA 值应被忽略,而仅考虑排名的值。这显示了条件,其中没有10个总值行,因为 2010-02-28 仅提供4个 5b


Edit using newly-discovered data. I'm inferring that the NA values should be ignored, and only the ranked ones should be considered. This shows a condition where there are not 10 total valued rows, as 2010-02-28 only provides 4 5b.

dat %>%
  group_by(Date) %>%
  mutate(
    R = min_rank(Value),
    Quux = case_when(
      R < 6                        ~ "5w",
      R > max(R, na.rm = TRUE) - 5 ~ "5b",
      TRUE                         ~ NA_character_)
    ) %>%
  ungroup() %>%
  arrange(Date, Value) %>%
  print(n=99)

# # A tibble: 30 x 4
#    Date       Value     R Quux 
#    <fct>      <dbl> <int> <chr>
#  1 2010-01-31     1     1 5w   
#  2 2010-01-31     1     1 5w   
#  3 2010-01-31     2     3 5w   
#  4 2010-01-31     3     4 5w   
#  5 2010-01-31     4     5 5w   
#  6 2010-01-31     5     6 <NA> 
#  7 2010-01-31     6     7 5b   
#  8 2010-01-31     7     8 5b   
#  9 2010-01-31     8     9 5b   
# 10 2010-01-31     9    10 5b   
# 11 2010-01-31    15    11 5b   
# 12 2010-01-31    NA    NA <NA> 
# 13 2010-01-31    NA    NA <NA> 
# 14 2010-01-31    NA    NA <NA> 
# 15 2010-01-31    NA    NA <NA> 
# 16 2010-01-31    NA    NA <NA> 
# 17 2010-02-28     2     1 5w   
# 18 2010-02-28     3     2 5w   
# 19 2010-02-28     4     3 5w   
# 20 2010-02-28     5     4 5w   
# 21 2010-02-28     6     5 5w   
# 22 2010-02-28     7     6 5b   
# 23 2010-02-28     8     7 5b   
# 24 2010-02-28     9     8 5b   
# 25 2010-02-28    15     9 5b   
# 26 2010-02-28    NA    NA <NA> 
# 27 2010-02-28    NA    NA <NA> 
# 28 2010-02-28    NA    NA <NA> 
# 29 2010-02-28    NA    NA <NA> 
# 30 2010-02-28    NA    NA <NA> 

这篇关于使用Group_by在R中的前5名和后5名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆