R:从数字变量和自定义/开放式/单值间隔创建分类变量 [英] R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals
问题描述
例如,假设我有一个数据变量 df $ V
的数据框架,并希望创建一个新变量 df $ VCAT
,以便:
-
df $ VCAT
= 0如果df $ V
等于0 -
df $ VCAT
= 1如果df $ V
在0到10之间(即(0,10)) -
df $ VCAT
= 2是df $ V
等于10(即[10,10] ) -
df $ VCAT
= 3是df $ V
在10之间到20(即(10,20)) -
df $ VCAT
= 4是df $ V
大于或等于20(即[20,Inf])
我目前正在做这个通过执行以下操作来自己编写评分函数:
df = data.frame(V = seq(1,100))
df = df%>%mutate(VCAT =(V> 0)+(V == 10)+ 2 *(V> 10)(V> = 20))`
我想知道是否有更简单的hacky方法来这在R中,最好使用 dplyr
(以便我可以链接命令)。理想情况下,我正在寻找一个可以在 mutate
中使用的简短函数,该变量将采用变量 V
和描述范围的向量,例如桶
。
请注意, buckets
可能没有以最好的方式描述,因为我不清楚如何允许用户自定义范围的端点。 / p>
我的bin数字的一种方式是使用模数opperator %%
。例如。分为20组:
#create raw data
unbinned< -c(1.1,1.53,5, 8.3,33.5,49.22,55,57.9,79.6,81,95,201,213
rawdata< -as.data.frame(unbinned)
#bin将数据分组为20
binneddata< -mutate(rawdata,binned = unbinned-unbinned %% 20)
#print数据
binneddata
这会产生输出:
unbinned binned
1 1.10 0
2 1.53 0
3 5.00 0
4 8.30 0
5 33.50 20
6 49.22 40
7 55.00 40
8 57.90 40
9 79.60 60
10 81.00 80
11 95.00 80
12 201.00 200
13 213.00 200
所以0表示0- <20,20表示20-< 40,40,40-< 60等(当然将在20分钟的时间段中,将$ / code>的值分配到原始问题中的顺序组合)
Bonu s
如果要在 ggplot
等中使用binned值作为分类变量,请转换他们变成字符串,他们会奇怪地订购,例如200将在40之前,因为'2'在字母表中的'4'之前,为了解决这个问题,请使用 sprintf
函数创建前导零。 (%03d
中的 3
应该是您期望的最长数字的位数):
将数据转换为前导零的字符串
binnedstring< -mutate(binneddata,bin_as_character = sprintf('%03d' bin))
#print数据
binnedstring
给输出:
unbinned binned bin_as_character
1 1.10 0 000
2 1.53 0 000
3 5.00 0 000
4 8.30 0 000
5 33.50 20 020
等
如果要使用 000-< 020
,请使用算术创建上限并使用粘贴功能连接:
#make可读bin值
binnedstringband< -mutate(
binnedstring,
nextband = binned + 20,
human_readable = paste(bin_as_character ,' - <',sprintf('%03d',nextband),sep ='')
)
#print数据
binnedstringband
给予:
unbinned binned bin_as_character nextband human_readable
1 1.10 0 000 20 000-< 020
2 1.53 0 000 20 000-< 020
3 5.00 0 000 20 000-< 020
4 8.30 0 000 20 000-< 020
5 33.50 20 020 40 020-< 040
等
I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges.
For instance, say that I have a data.frame with a numeric variable df$V
and would like to create a new variable df$VCAT
such that:
df$VCAT
= 0 ifdf$V
is equal to 0df$VCAT
= 1 ifdf$V
is between 0 to 10 (i.e. (0,10))df$VCAT
= 2 isdf$V
is equal to 10 (i.e. [10,10])df$VCAT
= 3 isdf$V
is between 10 to 20 (i.e. (10,20))df$VCAT
= 4 isdf$V
is greater or equal to than 20 (i.e. [20,Inf])
I am currently doing this by hard coding the "scoring function" myself by doing something like:
df = data.frame(V = seq(1,100))
df = df %>% mutate(VCAT = (V>0) + (V==10) + 2*(V>10) (V>=20))`
I am wondering if there is an easier hacky way to do this in R, preferably usingdplyr
(so that I can chain commands). Ideally, I am looking for a short function that can be used in mutate
that will take in the variable V
and a vector describing the ranges such as buckets
.
Note that buckets
may not be described in the best way here since it is not clear to me how it would allow users to customize the endpoints of the ranges.
A way I bin numbers is to remove the remainder using the modulus opperator, %%
. E.g. to bin into groups of 20:
#create raw data
unbinned<-c(1.1,1.53,5,8.3,33.5,49.22,55,57.9,79.6,81,95,201,213)
rawdata<-as.data.frame(unbinned)
#bin the data into groups of 20
binneddata<-mutate(rawdata,binned=unbinned-unbinned %% 20)
#print the data
binneddata
This produces the output:
unbinned binned
1 1.10 0
2 1.53 0
3 5.00 0
4 8.30 0
5 33.50 20
6 49.22 40
7 55.00 40
8 57.90 40
9 79.60 60
10 81.00 80
11 95.00 80
12 201.00 200
13 213.00 200
So 0 represents 0-<20, 20 represents 20-<40, 40 ,40-<60 etc. (of course divide the binned
value by 20 to get sequential groups like in the original question)
Bonus
If you want to use the binned values as categorical variables in ggplot
etc. by converting them into strings, they will order strangely, e.g. 200 will come before 40, because '2' comes before '4' in the alphabet, to get around this, use the sprintf
function to create leading zeros. (the 3
in %03d
should be the number of digits you expect the longest number to be):
#convert the data into strings with leading zeros
binnedstring<-mutate(binneddata,bin_as_character=sprintf('%03d',binned))
#print the data
binnedstring
giving the output:
unbinned binned bin_as_character
1 1.10 0 000
2 1.53 0 000
3 5.00 0 000
4 8.30 0 000
5 33.50 20 020
etc.
If you want to have 000-<020
, create the upper bound using arithmetic and concatenate using the paste function:
#make human readable bin value
binnedstringband<-mutate(
binnedstring,
nextband=binned+20,
human_readable=paste(bin_as_character,'-<',sprintf('%03d',nextband),sep='')
)
#print the data
binnedstringband
Giving:
unbinned binned bin_as_character nextband human_readable
1 1.10 0 000 20 000-<020
2 1.53 0 000 20 000-<020
3 5.00 0 000 20 000-<020
4 8.30 0 000 20 000-<020
5 33.50 20 020 40 020-<040
etc.
这篇关于R:从数字变量和自定义/开放式/单值间隔创建分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!