如何使用dplyr对列进行范围分组? [英] How to do range grouping on a column using dplyr?
问题描述
我想基于列的 range 值对data.table进行分组,如何使用dplyr库执行此操作?
例如,我的数据表如下:
library(data.table)
$ p $现在,我想以B列的0.05间隔将DT分为20组,并计算每组中有多少行。例如,列B值在[0,0.05)范围内的任何行将组成一个组; B列值在[0.05,0.1)范围内的任何行将组成另一个组,依此类推。
库(dplyr)
DT<-data.table(A = 1:100,B = runif(100),Amount = runif(100,0,100))
非常感谢您。
---- -------------------------有关akrun答案的更多问题。
感谢akrun的回答。我有一个有关剪切功能的新问题。如果我的DT如下所示:DT <-data.table(A = 1:10,B = c(0.01 ,0.04,0.06,0.09,0.1,0.13,0.14,0.15,0.17,0.71))
使用以下代码:
DT%&%;%
group_by(gr = cut(B,breaks = seq( 0,1,by = 0.05),right = F))%>%
summarise(n = n())%>%
排列(as.numeric(gr))
我希望看到这样的结果:
gr n
1 [0,0.05)2
2 [0.05,0.1)2
3 [0.1,0.15)3
4 [0.15 ,0.2)2
5 [0.7,0.75)1
,但我得到的结果是像这样:
gr n
1 [0,0.05)2
2 [0.05,0.1) 2
3 [0.1,0.15)4
4 [0.15,0.2)1
5 [0.7,0.75)1
看起来值0.15没有正确分配。对此有任何想法吗?
解决方案我们可以使用
cut
来完成分组。我们在group_by
中创建 gr列,并使用summarise
创建每个组中的元素数(n()
),然后根据 gr对输出进行排序(排列
)。library(dplyr)
DT%>%
group_by(gr = cut(B,breaks = seq(0,1, by = 0.05)))%>%
summarise(n = n())%&%;%
range(as.numeric(gr))
由于初始对象是
data.table
,这可以使用data.table
方法(包括@Frank的建议使用keyby
的方法)完成library(data.table)
DT [,。N,keyby =。(gr = cut(B,breaks = seq(0 ,1,by = 0.05)))]
编辑:
基于OP的更新,我们可以减去
seq
lvls<-level(cut(DT $ B,seq(0,1,by = 0.05)))
DT%>%
group_by( gr = cu t(B,breaks = seq(0,1,by = 0.05)-
.Machine $ double.eps,right = FALSE,labels = lvls))%>%
summarise(n = n ())%&%;%
排列(as.numeric(gr))
#gr n
#1(0,0.05] 2
#2(0.05,0.1] 2
#3(0.1,0.15] 3
#4(0.15,0.2] 2
#5(0.7,0.75] 1
I want to group a data.table based on a column's range value, how can I do this with the dplyr library?
For example, my data table is like below:
library(data.table) library(dplyr) DT <- data.table(A=1:100, B=runif(100), Amount=runif(100, 0, 100))
Now I want to group DT into 20 groups at 0.05 interval of column B, and count how many rows are in each group. e.g., any rows with a column B value in the range of [0, 0.05) will form a group; any rows with the column B value in the range of [0.05, 0.1) will form another group, and so on. Is there an efficient way of doing this group function?
Thank you very much.
-----------------------------More question on akrun's answer. Thanks akrun for your answer. I got a new question about the "cut" function. If my DT is like below:
DT <- data.table(A=1:10, B=c(0.01, 0.04, 0.06, 0.09, 0.1, 0.13, 0.14, 0.15, 0.17, 0.71))
by using the following code:
DT %>% group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05), right=F) ) %>% summarise(n= n()) %>% arrange(as.numeric(gr))
I expect to see results like this:
gr n 1 [0,0.05) 2 2 [0.05,0.1) 2 3 [0.1,0.15) 3 4 [0.15,0.2) 2 5 [0.7,0.75) 1
but the result I got is like this:
gr n 1 [0,0.05) 2 2 [0.05,0.1) 2 3 [0.1,0.15) 4 4 [0.15,0.2) 1 5 [0.7,0.75) 1
Looks like the value 0.15 is not correctly allocated. Any thoughts on this?
解决方案We can use
cut
to do the grouping. We create the 'gr' column within thegroup_by
, usesummarise
to create the number of elements in each group (n()
), and order the output (arrange
) based on 'gr'.library(dplyr) DT %>% group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>% summarise(n= n()) %>% arrange(as.numeric(gr))
As the initial object is
data.table
, this can be done usingdata.table
methods (included @Frank's suggestion to usekeyby
)library(data.table) DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]
EDIT:
Based on the update in the OP's post, we could substract a small number to the
seq
lvls <- levels(cut(DT$B, seq(0, 1, by =0.05))) DT %>% group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) - .Machine$double.eps, right=FALSE, labels=lvls)) %>% summarise(n=n()) %>% arrange(as.numeric(gr)) # gr n #1 (0,0.05] 2 #2 (0.05,0.1] 2 #3 (0.1,0.15] 3 #4 (0.15,0.2] 2 #5 (0.7,0.75] 1
这篇关于如何使用dplyr对列进行范围分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!