如何使用 dplyr 对列进行范围分组? [英] How to do range grouping on a column using dplyr?
问题描述
我想根据列的范围值对 data.table 进行分组,我该如何使用 dplyr 库执行此操作?
I want to group a data.table based on a column's range value, how can I do this with the dplyr library?
例如,我的数据表如下:
For example, my data table is like below:
library(data.table)
library(dplyr)
DT <- data.table(A=1:100, B=runif(100), Amount=runif(100, 0, 100))
现在我想将 DT 以 B 列的 0.05 间隔分成 20 组,并计算每组中有多少行.例如,任何 B 列值在 [0, 0.05) 范围内的行将形成一个组;B 列值在 [0.05, 0.1) 范围内的任何行将形成另一个组,依此类推.有没有有效的方法来完成这个小组功能?
Now I want to group DT into 20 groups at 0.05 interval of column B, and count how many rows are in each group. e.g., any rows with a column B value in the range of [0, 0.05) will form a group; any rows with the column B value in the range of [0.05, 0.1) will form another group, and so on. Is there an efficient way of doing this group function?
非常感谢.
-----------------------------关于阿克伦回答的更多问题.感谢阿克伦的回答.我有一个关于剪切"功能的新问题.如果我的 DT 如下所示:
-----------------------------More question on akrun's answer. Thanks akrun for your answer. I got a new question about the "cut" function. If my DT is like below:
DT <- data.table(A=1:10, B=c(0.01, 0.04, 0.06, 0.09, 0.1, 0.13, 0.14, 0.15, 0.17, 0.71))
通过使用以下代码:
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05), right=F) ) %>%
summarise(n= n()) %>%
arrange(as.numeric(gr))
我希望看到这样的结果:
I expect to see results like this:
gr n
1 [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 3
4 [0.15,0.2) 2
5 [0.7,0.75) 1
但我得到的结果是这样的:
but the result I got is like this:
gr n
1 [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 4
4 [0.15,0.2) 1
5 [0.7,0.75) 1
看起来值 0.15 没有正确分配.对此有什么想法吗?
Looks like the value 0.15 is not correctly allocated. Any thoughts on this?
推荐答案
我们可以使用cut
来进行分组.我们在 group_by
中创建 'gr' 列,使用 summarise
来创建每个组中的元素数 (n()
),然后根据gr"对输出进行排序(arrange
).
We can use cut
to do the grouping. We create the 'gr' column within the group_by
, use summarise
to create the number of elements in each group (n()
), and order the output (arrange
) based on 'gr'.
library(dplyr)
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>%
summarise(n= n()) %>%
arrange(as.numeric(gr))
<小时>
由于初始对象是 data.table
,这可以使用 data.table
方法来完成(包括@Frank 建议使用 keyby
)
As the initial object is data.table
, this can be done using data.table
methods (included @Frank's suggestion to use keyby
)
library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]
根据 OP 帖子中的更新,我们可以从 seq
Based on the update in the OP's post, we could substract a small number to the seq
lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
.Machine$double.eps, right=FALSE, labels=lvls)) %>%
summarise(n=n()) %>%
arrange(as.numeric(gr))
# gr n
#1 (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1
这篇关于如何使用 dplyr 对列进行范围分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!