如何通过对数据框中的列进行排序来快速形成组(四分位数、十分位数等) [英] How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame

查看:32
本文介绍了如何通过对数据框中的列进行排序来快速形成组(四分位数、十分位数等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到很多关于ordersort 的问题和答案.是否有任何东西可以将向量或数据框分类为分组(如四分位数或十分位数)?我有一个手动"解决方案,但可能有一个经过小组测试的更好的解决方案.

I see a lot of questions and answers re order and sort. Is there anything that sorts vectors or data frames into groupings (like quartiles or deciles)? I have a "manual" solution, but there's likely a better solution that has been group-tested.

这是我的尝试:

temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
temp
#    name       value quartile
# 1     a  2.55118169       NA
# 2     b  0.79755259       NA
# 3     c  0.16918905       NA
# 4     d  1.73359245       NA
# 5     e  0.41027113       NA
# 6     f  0.73012966       NA
# 7     g -1.35901658       NA
# 8     h -0.80591167       NA
# 9     i  0.48966739       NA
# 10    j  0.88856758       NA
# 11    k  0.05146856       NA
# 12    l -0.12310229       NA
temp.sorted <- temp[order(temp$value), ]
temp.sorted$quartile <- rep(1:4, each=12/4)
temp <- temp.sorted[order(as.numeric(rownames(temp.sorted))), ]
temp
#    name       value quartile
# 1     a  2.55118169        4
# 2     b  0.79755259        3
# 3     c  0.16918905        2
# 4     d  1.73359245        4
# 5     e  0.41027113        2
# 6     f  0.73012966        3
# 7     g -1.35901658        1
# 8     h -0.80591167        1
# 9     i  0.48966739        3
# 10    j  0.88856758        4
# 11    k  0.05146856        2
# 12    l -0.12310229        1

有更好(更干净/更快/一行)的方法吗?谢谢!

Is there a better (cleaner/faster/one-line) approach? Thanks!

推荐答案

我使用的方法是其中之一或者 Hmisc::cut2(value, g=4):

The method I use is one of these or Hmisc::cut2(value, g=4):

temp$quartile <- with(temp, cut(value, 
                                breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE), 
                                include.lowest=TRUE))

另一种可能是:

temp$quartile <- with(temp, factor(
                            findInterval( val, c(-Inf,
                               quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE), 
                            labels=c("Q1","Q2","Q3","Q4")
      ))

第一个具有用值标记四分位数的副作用,我认为这是好事",但如果它不是对你有好处",或者评论中提出的有效问题是一个问题您可以使用第 2 版.您可以在 cut 中使用 labels=,或者您可以将此行添加到您的代码中:

The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels= in cut, or you could add this line to your code:

temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )

或者甚至更快,但它的工作原理稍微有点模糊,尽管它不再是一个因素,而是一个数字向量:

Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:

temp$quartile <- as.numeric(temp$quartile)

这篇关于如何通过对数据框中的列进行排序来快速形成组(四分位数、十分位数等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆