用dplyr分类连续变量 [英] Categorize continuous variable with dplyr
问题描述
我想基于连续数据创建一个具有3个任意类别的新变量。
I want to create a new variable with 3 arbitrary categories based on continuous data.
set.seed(123)
df <- data.frame(a = rnorm(100))
使用基数I会
df$category[df$a < 0.5] <- "low"
df$category[df$a > 0.5 & df$a < 0.6] <- "middle"
df$category[df$a > 0.6] <- "high"
是否有dplyr,我想 mutate()
,为此的解决方案?
Is there a dplyr, I guess mutate()
, solution for this?
此外,有一种方法可以计算类别而不是选择类别吗?即让R计算类别的分隔符应该在哪里。
Furthermore, is there a way to calculate the categories rather than choosing them? I.e. let R calculate where the breaks for the categories should be.
EDIT
答案就在此线程,但是它不涉及标签,这使我感到困惑(并且可能
The answer is in this thread, however, it does not involve labelling, which confused me (and may confuse others) therefore I think this question serves a purpose.
推荐答案
要从数字转换为分类,请使用剪切
。在您的特定情况下,您需要:
To convert from numeric to categorical, use cut
. In your particular case, you want:
df$category <- cut(df$a,
breaks=c(-Inf, 0.5, 0.6, Inf),
labels=c("low","middle","high"))
或者使用 dplyr
:
library(dplyr)
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
## a category
##1 -0.560475647 low
##2 -0.230177489 low
##3 1.558708314 high
##4 0.070508391 low
##5 0.129287735 low
## ...
##35 0.821581082 high
##36 0.688640254 high
##37 0.553917654 middle
##38 -0.061911711 low
##39 -0.305962664 low
##40 -0.380471001 low
## ...
##96 -0.600259587 low
##97 2.187332993 high
##98 1.532610626 high
##99 -0.235700359 low
##100 -1.026420900 low
这篇关于用dplyr分类连续变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!