R:从数字变量和自定义/开放式/单值间隔创建分类变量 [英] R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

查看:146
本文介绍了R:从数字变量和自定义/开放式/单值间隔创建分类变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常发现自己试图从数字变量+用户提供的一组范围创建一个分类变量。



例如,假设我有一个数据变量 df $ V 的数据框架,并希望创建一个新变量 df $ VCAT ,以便:




  • df $ VCAT = 0如果 df $ V 等于0

  • df $ VCAT = 1如果 df $ V 在0到10之间(即(0,10))

  • df $ VCAT = 2是 df $ V 等于10(即[10,10] )

  • df $ VCAT = 3是 df $ V 在10之间到20(即(10,20))

  • df $ VCAT = 4是 df $ V 大于或等于20(即[20,Inf])



我目前正在做这个通过执行以下操作来自己编写评分函数:

  df = data.frame(V = seq(1,100)) 
df = df%>%mutate(VCAT =(V> 0)+(V == 10)+ 2 *(V> 10)(V> = 20))`

我想知道是否有更简单的hacky方法来这在R中,最好使用 dplyr (以便我可以链接命令)。理想情况下,我正在寻找一个可以在 mutate 中使用的简短函数,该变量将采用变量 V 和描述范围的向量,例如
请注意, buckets 可能没有以最好的方式描述,因为我不清楚如何允许用户自定义范围的端点。 / p>

解决方案

我的bin数字的一种方式是使用模数opperator %% 。例如。分为20组:

  #create raw data 
unbinned< -c(1.1,1.53,5, 8.3,33.5,49.22,55,57.9,79.6,81,95,201,213
rawdata< -as.data.frame(unbinned)

#bin将数据分组为20
binneddata< -mutate(rawdata,binned = unbinned-unbinned %% 20)

#print数据
binneddata

这会产生输出:

  unbinned binned 
1 1.10 0
2 1.53 0
3 5.00 0
4 8.30 0
5 33.50 20
6 49.22 40
7 55.00 40
8 57.90 40
9 79.60 60
10 81.00 80
11 95.00 80
12 201.00 200
13 213.00 200

所以0表示0- <20,20表示20-< 40,40,40-< 60等(当然将在20分钟的时间段中,将$ / code>的值分配到原始问题中的顺序组合)



Bonu s



如果要在 ggplot 等中使用binned值作为分类变量,请转换他们变成字符串,他们会奇怪地订购,例如200将在40之前,因为'2'在字母表中的'4'之前,为了解决这个问题,请使用 sprintf 函数创建前导零。 (%03d 中的 3 应该是您期望的最长数字的位数):

 将数据转换为前导零的字符串
binnedstring< -mutate(binneddata,bin_as_character = sprintf('%03d' bin))

#print数据
binnedstring

给输出:

  unbinned binned bin_as_character 
1 1.10 0 000
2 1.53 0 000
3 5.00 0 000
4 8.30 0 000
5 33.50 20 020

如果要使用 000-< 020 ,请使用算术创建上限并使用粘贴功能连接:

  #make可读bin值
binnedstringband< -mutate(
binnedstring,
nextband = binned + 20,
human_readable = paste(bin_as_character ,' - <',sprintf('%03d',nextband),sep ='')


#print数据
binnedstringband

给予:

  unbinned binned bin_as_character nextband human_readable 
1 1.10 0 000 20 000-< 020
2 1.53 0 000 20 000-< 020
3 5.00 0 000 20 000-< 020
4 8.30 0 000 20 000-< 020
5 33.50 20 020 40 020-< 040


I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges.

For instance, say that I have a data.frame with a numeric variable df$V and would like to create a new variable df$VCAT such that:

  • df$VCAT = 0 if df$V is equal to 0
  • df$VCAT = 1 if df$V is between 0 to 10 (i.e. (0,10))
  • df$VCAT = 2 is df$V is equal to 10 (i.e. [10,10])
  • df$VCAT = 3 is df$V is between 10 to 20 (i.e. (10,20))
  • df$VCAT = 4 is df$V is greater or equal to than 20 (i.e. [20,Inf])

I am currently doing this by hard coding the "scoring function" myself by doing something like:

df = data.frame(V = seq(1,100))
df = df %>% mutate(VCAT = (V>0) + (V==10) + 2*(V>10) (V>=20))`

I am wondering if there is an easier hacky way to do this in R, preferably usingdplyr (so that I can chain commands). Ideally, I am looking for a short function that can be used in mutate that will take in the variable V and a vector describing the ranges such as buckets. Note that buckets may not be described in the best way here since it is not clear to me how it would allow users to customize the endpoints of the ranges.

解决方案

A way I bin numbers is to remove the remainder using the modulus opperator, %%. E.g. to bin into groups of 20:

#create raw data
unbinned<-c(1.1,1.53,5,8.3,33.5,49.22,55,57.9,79.6,81,95,201,213)
rawdata<-as.data.frame(unbinned)

#bin the data into groups of 20
binneddata<-mutate(rawdata,binned=unbinned-unbinned %% 20)

#print the data
binneddata

This produces the output:

   unbinned binned
1      1.10      0
2      1.53      0
3      5.00      0
4      8.30      0
5     33.50     20
6     49.22     40
7     55.00     40
8     57.90     40
9     79.60     60
10    81.00     80
11    95.00     80
12   201.00    200
13   213.00    200

So 0 represents 0-<20, 20 represents 20-<40, 40 ,40-<60 etc. (of course divide the binned value by 20 to get sequential groups like in the original question)

Bonus

If you want to use the binned values as categorical variables in ggplot etc. by converting them into strings, they will order strangely, e.g. 200 will come before 40, because '2' comes before '4' in the alphabet, to get around this, use the sprintf function to create leading zeros. (the 3 in %03d should be the number of digits you expect the longest number to be):

#convert the data into strings with leading zeros
binnedstring<-mutate(binneddata,bin_as_character=sprintf('%03d',binned))

#print the data
binnedstring

giving the output:

   unbinned binned bin_as_character
1      1.10      0              000
2      1.53      0              000
3      5.00      0              000
4      8.30      0              000
5     33.50     20              020
etc.

If you want to have 000-<020, create the upper bound using arithmetic and concatenate using the paste function:

#make human readable bin value
binnedstringband<-mutate(
    binnedstring,
    nextband=binned+20,
    human_readable=paste(bin_as_character,'-<',sprintf('%03d',nextband),sep='')
)

#print the data
binnedstringband

Giving:

   unbinned binned bin_as_character nextband     human_readable
1      1.10      0              000       20           000-<020
2      1.53      0              000       20           000-<020
3      5.00      0              000       20           000-<020
4      8.30      0              000       20           000-<020
5     33.50     20              020       40           020-<040
etc.

这篇关于R:从数字变量和自定义/开放式/单值间隔创建分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆