使用dplyr窗口函数来计算百分位数 [英] Using dplyr window functions to calculate percentiles
问题描述
使用mtcars数据集如果我想看看第25,第50,第75百分位数和每加仑英里(mpg)的平均和数量乘以气缸数(cyl),我使用以下代码:
库(dplyr)
库(tidyr)
#加载数据
数据(mtcars )
#计算中使用的百分位数
p< - c(.25,.5,.75)
#old dplyr solution
mtcars %>%group_by(cyl)%>%
do(data.frame(p = p,stats = quantile(。$ mpg,probs = p))
n =长度(。$ mpg) ,avg = mean(。$ mpg)))%>%
spread(p,stats)%>%
select(1,4:6,3,2)
#注意:选择和传播语句只是将数据写入
#我想要看到的格式,但不是关键的
使用一些简要功能(n_tiles,percent_rank等),我可以使用dplyr更干净地执行此操作吗?干脆地,我的意思是没有做声明。
谢谢
这是一个避免 do
的 dplyr
方法,但需要单独调用
为每个分位数值。
mtcars%>%group_by(cyl)%>%
总结(`25%`= mpg,probs = 0.25),
`50%`= quantile(mpg,probs = 0.5),
`75%`=位数(mpg,probs = 0.75),
avg = (mpg),
n = n())
cyl 25%50%75%平均值
1 4 22.80 26.0 30.40 26.66364 11
2 6 18.65 19.7 21.00 19.74286 7
3 8 14.40 15.2 16.25 15.10000 14
如果 summary
可以通过单次调用 quantile
返回多个值,但这似乎是 dplyr 开发中的.com / hadley / dplyr / issues / 154>开放问题。
I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.
Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:
library(dplyr)
library(tidyr)
# load data
data("mtcars")
# Percentiles used in calculation
p <- c(.25,.5,.75)
# old dplyr solution
mtcars %>% group_by(cyl) %>%
do(data.frame(p=p, stats=quantile(.$mpg, probs=p),
n = length(.$mpg), avg = mean(.$mpg))) %>%
spread(p, stats) %>%
select(1, 4:6, 3, 2)
# note: the select and spread statements are just to get the data into
# the format in which I'd like to see it, but are not critical
Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.
Thank you
Here's a dplyr
approach that avoids do
but requires a separate call to quantile
for each quantile value.
mtcars %>% group_by(cyl) %>%
summarise(`25%`=quantile(mpg, probs=0.25),
`50%`=quantile(mpg, probs=0.5),
`75%`=quantile(mpg, probs=0.75),
avg=mean(mpg),
n=n())
cyl 25% 50% 75% avg n
1 4 22.80 26.0 30.40 26.66364 11
2 6 18.65 19.7 21.00 19.74286 7
3 8 14.40 15.2 16.25 15.10000 14
It would be better if summarise
could return multiple values with a single call to quantile
, but this appears to be an open issue in dplyr
development.
这篇关于使用dplyr窗口函数来计算百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!