使用dplyr窗口函数来计算百分位数 [英] Using dplyr window functions to calculate percentiles

查看:180
本文介绍了使用dplyr窗口函数来计算百分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个工作的解决方案,但正在寻找一个更干净,更可读的解决方案,可能利用一些较新的dplyr窗口函数。



使用mtcars数据集如果我想看看第25,第50,第75百分位数和每加仑英里(mpg)的平均和数量乘以气缸数(cyl),我使用以下代码:

 库(dplyr)
库(tidyr)

#加载数据
数据(mtcars )

#计算中使用的百分位数
p< - c(.25,.5,.75)

#old dplyr solution
mtcars %>%group_by(cyl)%>%
do(data.frame(p = p,stats = quantile(。$ mpg,probs = p))
n =长度(。$ mpg) ,avg = mean(。$ mpg)))%>%
spread(p,stats)%>%
select(1,4:6,3,2)

#注意:选择和传播语句只是将数据写入
#我想要看到的格式,但不是关键的

使用一些简要功能(n_tiles,percent_rank等),我可以使用dplyr更干净地执行此操作吗?干脆地,我的意思是没有做声明。



谢谢

解决方案>

这是一个避免 do dplyr 方法,但需要单独调用 为每个分位数值。

  mtcars%>%group_by(cyl)%>%
总结(`25%`= mpg,probs = 0.25),
`50%`= quantile(mpg,probs = 0.5),
`75%`=位数(mpg,probs = 0.75),
avg = (mpg),
n = n())

cyl 25%50%75%平均值
1 4 22.80 26.0 30.40 26.66364 11
2 6 18.65 19.7 21.00 19.74286 7
3 8 14.40 15.2 16.25 15.10000 14

如果 summary 可以通过单次调用 quantile 返回多个值,但这似乎是 dplyr 开发中的.com / hadley / dplyr / issues / 154>开放问题。


I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.

Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:

library(dplyr)
library(tidyr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution 
mtcars %>% group_by(cyl) %>% 
  do(data.frame(p=p, stats=quantile(.$mpg, probs=p), 
                n = length(.$mpg), avg = mean(.$mpg))) %>%
  spread(p, stats) %>%
  select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
#       the format in which I'd like to see it, but are not critical

Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.

Thank you

解决方案

Here's a dplyr approach that avoids do but requires a separate call to quantile for each quantile value.

mtcars %>% group_by(cyl) %>%
  summarise(`25%`=quantile(mpg, probs=0.25),
            `50%`=quantile(mpg, probs=0.5),
            `75%`=quantile(mpg, probs=0.75),
            avg=mean(mpg),
            n=n())

  cyl   25%  50%   75%      avg  n
1   4 22.80 26.0 30.40 26.66364 11
2   6 18.65 19.7 21.00 19.74286  7
3   8 14.40 15.2 16.25 15.10000 14

It would be better if summarise could return multiple values with a single call to quantile, but this appears to be an open issue in dplyr development.

这篇关于使用dplyr窗口函数来计算百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆