用dplyr分类连续变量 [英] Categorize continuous variable with dplyr

查看:106
本文介绍了用dplyr分类连续变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想基于连续数据创建一个具有3个任意类别的新变量。

I want to create a new variable with 3 arbitrary categories based on continuous data.

set.seed(123)
df <- data.frame(a = rnorm(100))

使用基数I会

df$category[df$a < 0.5] <- "low"
df$category[df$a > 0.5 & df$a < 0.6] <- "middle"
df$category[df$a > 0.6] <- "high"

是否有dplyr,我想 mutate(),为此的解决方案?

Is there a dplyr, I guess mutate(), solution for this?

此外,有一种方法可以计算类别而不是选择类别吗?即让R计算类别的分隔符应该在哪里。

Furthermore, is there a way to calculate the categories rather than choosing them? I.e. let R calculate where the breaks for the categories should be.

EDIT

答案就在此线程,但是它不涉及标签,这使我感到困惑(并且可能

The answer is in this thread, however, it does not involve labelling, which confused me (and may confuse others) therefore I think this question serves a purpose.

推荐答案

要从数字转换为分类,请使用剪切。在您的特定情况下,您需要:

To convert from numeric to categorical, use cut. In your particular case, you want:

df$category <- cut(df$a, 
                   breaks=c(-Inf, 0.5, 0.6, Inf), 
                   labels=c("low","middle","high"))

或者使用 dplyr

library(dplyr)
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
##               a category
##1   -0.560475647      low
##2   -0.230177489      low
##3    1.558708314     high
##4    0.070508391      low
##5    0.129287735      low
## ...
##35   0.821581082     high
##36   0.688640254     high
##37   0.553917654   middle
##38  -0.061911711      low
##39  -0.305962664      low
##40  -0.380471001      low
## ...
##96  -0.600259587      low
##97   2.187332993     high
##98   1.532610626     high
##99  -0.235700359      low
##100 -1.026420900      low

这篇关于用dplyr分类连续变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆