使用Group_by在R中的前5名和后5名 [英] Top 5 and bottom 5 in r using Group_by
问题描述
我正在寻找将值分配给5个最高值和5个最低值的代码或功能。例如,这可能来自类似以下的数据集:
I am looking for a code or feature that assigns a value to the 5 highest values and 5 lowest values. This could, for example, be from a dataset similar to this:
df <- data.frame(
Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)),
Value=c(rep(c(1,2,3,4,5,6,7,8,9,NA,NA,NA,NA,NA,15),2))
)
编辑:这只是一个示例数据。我使用的数据更加复杂,因此代码应允许Date列的长度不同,并且允许丢失多个值(NA)。
This is just a sample data. The data I use is more complex and the code should, therefore, allow for varying lengths of the column Date and also for multiple values that are missing (NAs).
I然后希望将一个值分配给五个最低值,分别等于 5w,将 5b分配给五个最高值。然后应根据日期将数据包装在group_by中,以便在每个周期重复该过程。我尝试使用百分位数,但是此方法在每个括号中都不能保持恒定数量的值。因此,我正在寻找一种方法,该方法允许每个方括号中的值数量恒定。如果可能的话,最好将所有公司放在5%的括号内。通过这个,我的意思是在所有公司分布的情况下有20个括号。这意味着最好的支架将由总价值最高的公司的5%组成。值可以是0:19。也就是说,在最高值范围内的公司的预期产出为19,而在最低值范围内的公司的预期产出为0。
I would then like a value assigned to the five lowest equal to "5w" and "5b" to the 5 highest values The data should then be wrapped in a group_by based on the date so that the process is repeated at each period. I have tried using percentile but this method does not maintain a constant number of values in each bracket. I am therefore looking for a method that allows the number of values in each bracket to be constant. If it is possible it would be nice to put all firms into 5% brackets. By this, I mean to have 20 brackets with all firms distributed. This means that the best bracket would consist of 5% of total firms with the highest value. The values could be 0:19. I.e meaning that the expected output of a firm in the highest value bracket would be 19 and a firm in the lowest bracket would receive a value of 0.
在此先感谢
推荐答案
注意:虽然我怀疑这只是示例数据,但您有两个 1 $
2010-01-31
中的c $ c>。该代码说明了这一点,但是当未排序时,输出看起来很奇怪。为此,我要添加 arrange
来显示它们。
Heads up: while I suspect that this is just sample data, you have two 1
s in 2010-01-31
. This code accounts for that, but when unsorted the output looks odd. For that, I'm adding arrange
to show them.
我使用 min_rank
,假设您不想要领带,并且总是想要顶部/底部5。另一种选择是 dense_rank
,在 2010-01-31
中标记顶部的 6 ,原因是并列 1
。
I use min_rank
here, assuming that you do not want ties and always want top/bottom 5. An alternative is dense_rank
, which would label the top six from 2010-01-31
due to tie for 1
.
library(dpyr)
dat %>%
group_by(Date) %>%
mutate(
R = min_rank(Value),
Quux = case_when(
R < 6 ~ "5w",
R > n() - 5 ~ "5b",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
arrange(Date, Value) %>%
print(n=99)
# # A tibble: 30 x 4
# Date Value R Quux
# <fct> <int> <int> <chr>
# 1 2010-01-31 1 1 5w
# 2 2010-01-31 1 1 5w
# 3 2010-01-31 2 3 5w
# 4 2010-01-31 3 4 5w
# 5 2010-01-31 4 5 5w
# 6 2010-01-31 5 6 <NA>
# 7 2010-01-31 6 7 <NA>
# 8 2010-01-31 7 8 <NA>
# 9 2010-01-31 8 9 <NA>
# 10 2010-01-31 9 10 <NA>
# 11 2010-01-31 10 11 <NA>
# 12 2010-01-31 11 12 5b
# 13 2010-01-31 12 13 5b
# 14 2010-01-31 13 14 5b
# 15 2010-01-31 14 15 5b
# 16 2010-01-31 15 16 5b
# 17 2010-02-28 2 1 5w
# 18 2010-02-28 3 2 5w
# 19 2010-02-28 4 3 5w
# 20 2010-02-28 5 4 5w
# 21 2010-02-28 6 5 5w
# 22 2010-02-28 7 6 <NA>
# 23 2010-02-28 8 7 <NA>
# 24 2010-02-28 9 8 <NA>
# 25 2010-02-28 10 9 <NA>
# 26 2010-02-28 11 10 5b
# 27 2010-02-28 12 11 5b
# 28 2010-02-28 13 12 5b
# 29 2010-02-28 14 13 5b
# 30 2010-02-28 15 14 5b
编辑,使用新发现的数据。我推断 NA
值应被忽略,而仅考虑排名的值。这显示了条件,其中没有10个总值行,因为 2010-02-28
仅提供4个 5b
。
Edit using newly-discovered data. I'm inferring that the NA
values should be ignored, and only the ranked ones should be considered. This shows a condition where there are not 10 total valued rows, as 2010-02-28
only provides 4 5b
.
dat %>%
group_by(Date) %>%
mutate(
R = min_rank(Value),
Quux = case_when(
R < 6 ~ "5w",
R > max(R, na.rm = TRUE) - 5 ~ "5b",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
arrange(Date, Value) %>%
print(n=99)
# # A tibble: 30 x 4
# Date Value R Quux
# <fct> <dbl> <int> <chr>
# 1 2010-01-31 1 1 5w
# 2 2010-01-31 1 1 5w
# 3 2010-01-31 2 3 5w
# 4 2010-01-31 3 4 5w
# 5 2010-01-31 4 5 5w
# 6 2010-01-31 5 6 <NA>
# 7 2010-01-31 6 7 5b
# 8 2010-01-31 7 8 5b
# 9 2010-01-31 8 9 5b
# 10 2010-01-31 9 10 5b
# 11 2010-01-31 15 11 5b
# 12 2010-01-31 NA NA <NA>
# 13 2010-01-31 NA NA <NA>
# 14 2010-01-31 NA NA <NA>
# 15 2010-01-31 NA NA <NA>
# 16 2010-01-31 NA NA <NA>
# 17 2010-02-28 2 1 5w
# 18 2010-02-28 3 2 5w
# 19 2010-02-28 4 3 5w
# 20 2010-02-28 5 4 5w
# 21 2010-02-28 6 5 5w
# 22 2010-02-28 7 6 5b
# 23 2010-02-28 8 7 5b
# 24 2010-02-28 9 8 5b
# 25 2010-02-28 15 9 5b
# 26 2010-02-28 NA NA <NA>
# 27 2010-02-28 NA NA <NA>
# 28 2010-02-28 NA NA <NA>
# 29 2010-02-28 NA NA <NA>
# 30 2010-02-28 NA NA <NA>
这篇关于使用Group_by在R中的前5名和后5名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!