相对频率/比例与dplyr [英] Relative frequencies / proportions with dplyr

查看:113
本文介绍了相对频率/比例与dplyr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想计算每个组中不同值的比例。例如,使用 mtcars 数据,如何通过计算齿轮数量的相对频率在 dplyr 之间(自动/手动)

  library(dplyr)
data(mtcars)
mtcars< - tbl_df(mtcars)

#计数频率
mtcars%>%
group_by(am,gear)%>%
summaryize(n = n())

#am gear n
#0 3 15
#0 4 4
#1 4 8
#1 5 5

我想要什么实现:

  am gear n rel.freq 
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154


解决方案

尝试这样:

  mtcars%>%
group_by(am,gear)%>%
summary(n = n())%>%
mutate(freq = n / sum(n))

#am gear nf req
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154

dplyr vignette


当您通过多个变量分组时,每个摘要剥离一个级别的分组。这样可以轻松地逐行汇总数据集。


因此,在总结,分组变量'gear'被剥离,然后数据被'am'分组(只需在组中查看数据),然后我们执行 mutate 计算。



剥离的结果当然是依赖于 group_by 调用中的分组变量的顺序,这次我们很幸运,它剥离了所需的变量,你可能希望做一个后续的 group_by(am),以使您的代码更加明确。



对于四舍五入和漂亮,请参阅@ Tyler Rinker。


Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?

library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)

# count frequency
mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n())

# am gear  n
#  0    3 15 
#  0    4  4 
#  1    4  8  
#  1    5  5 

What I would like to achieve:

am gear  n rel.freq
 0    3 15      0.7894737
 0    4  4      0.2105263
 1    4  8      0.6153846
 1    5  5      0.3846154

解决方案

Try this:

mtcars %>%
  group_by(am, gear) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

#   am gear  n      freq
# 1  0    3 15 0.7894737
# 2  0    4  4 0.2105263
# 3  1    4  8 0.6153846
# 4  1    5  5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset".

Thus, after the summarise, the grouping variable 'gear' is peeled off, and the data is then grouped 'only' by 'am' (just check it with groups on the resulting data), on which we then perform the mutate calculation.

The outcome of the 'peeling' is of course dependent of the order of the grouping variables in the group_by call. We were lucky this time, that it peeled off the desired variable. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

这篇关于相对频率/比例与dplyr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆