如何在 R 中对 4 个相同大小的组中的连续变量进行分类? [英] How to categorize a continuous variable in 4 groups of the same size in R?
问题描述
我需要将一个连续变量分为 4 个类别,每个类别都具有相同的观察次数.我用过这个功能
I need to categorize a continuous variable in 4 classes each one with the same number of observations. I have used the function
cut(x,breaks = quantile(x,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE))
我的问题是每个类别中的观察数量并不完全相同,因为有些观察(并且不止一个)具有完全相同的分位数值.我该怎么做?
My problem is that the number of observations in each category is not exactly the same because there are observations (and more than one) which have exactly the same value of the quantiles. How can I do it?
我的变量正在等待
[1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
[26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
[51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
[76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
[101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
[126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
[151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
[176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
[201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
[226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
[251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74
它在 R 中忠实的数据集中.它有 272 个观测值,因此它可以被 4 整除,每个类别中有 68 个观测值.
which is in the dataset faithful in R. It has 272 observations, therefore it is divisible by 4 giving 68 observations in each category.
我用过
newwait<-cut(waiting, breaks =quantile(waiting,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE)
table(newwait)
newwait
[43,58) [58,76) [76,82) [82,96]
66 68 67 71
如您所见,每组中的观察数相似但不完全相同.
as you can see, the number of observations in each group is similar but not exactly the same.
推荐答案
基本上,听起来您需要处理关系.您还需要一个向量,其长度除以 4 时会产生一个整数……但我假设您知道这一点.
Basically, it sounds like you need to deal with ties. You also need to have a vector whose length, when divided by 4, yields an integer...but I'll assume you know that.
这是使用 rank
的决胜局函数的解决方案:
Here's a solution using the tie-breaking functions of rank
:
set.seed(1)
x <- round(runif(1000,0,1),1)
table(x)
## x
## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
## 43 106 95 103 112 109 82 102 95 100 53
y <- rank(x, ties.method='first') # <- this forces tie breaks
cuts <- cut(y, breaks = quantile(y,probs=seq(0,1,0.25)),
include.lowest=TRUE,
right=FALSE)
# check that cuts are all the same length:
lapply(split(x,cuts), length)
$`[1,251)`
[1] 250
$`[251,500)`
[1] 250
$`[500,750)`
[1] 250
$`[750,1e+03]`
[1] 250
这篇关于如何在 R 中对 4 个相同大小的组中的连续变量进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!