dplyr - 按组大小过滤 [英] dplyr - filter by group size
本文介绍了dplyr - 按组大小过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
过滤 data.frame 以仅获取大小为 5 的组的最佳方法是什么?
What is the best way to filter a data.frame to only get groups of say size 5?
所以我的数据如下所示:
So my data looks as follows:
require(dplyr)
n <- 1e5
x <- rnorm(n)
# Category size ranging each from 1 to 5
cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n]
dat <- data.frame(x = x, cat = cat)
我能想到的 dplyr 方法是
The dplyr way i could come up with was
dat <- group_by(dat, cat)
system.time({
out1 <- dat %>% filter(n() == 5L)
})
# user system elapsed
# 1.157 0.218 1.497
但这很慢... dplyr 有没有更好的方法?
But this is very slow... Is there a better way in dplyr?
到目前为止,我的解决方法如下:
So far my workaround solutions looks as follows:
system.time({
all_ind <- rep(seq_len(n_groups(dat)), group_size(dat))
take_only <- which(group_size(dat) == 5L)
out2 <- dat[all_ind %in% take_only, ]
})
# user system elapsed
# 0.026 0.008 0.036
all.equal(out1, out2) # TRUE
但是这感觉不太像...
But this doesn't feel very dplyr like...
推荐答案
这是另一种你可以尝试的 dplyr 方法
Here's another dplyr approach you can try
semi_join(dat, count(dat, cat) %>% filter(n == 5), by = "cat")
--
这是另一种基于 OP 原始方法并稍作修改的方法:
Here's another approach based on OP's original approach with a little modification:
n <- 1e5
x <- rnorm(n)
# Category size ranging each from 1 to 5
cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n]
dat <- data.frame(x = x, cat = cat)
# second data set for the dt approch
dat2 <- data.frame(x = x, cat = cat)
sol_floo0 <- function(dat){
dat <- group_by(dat, cat)
all_ind <- rep(seq_len(n_groups(dat)), group_size(dat))
take_only <- which(group_size(dat) == 5L)
dat[all_ind %in% take_only, ]
}
sol_floo0_v2 <- function(dat){
g <- group_by(dat, cat) %>% group_size()
ind <- rep(g == 5, g)
dat[ind, ]
}
microbenchmark::microbenchmark(times = 10,
sol_floo0(dat),
sol_floo0_v2(dat2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# sol_floo0(dat) 43.72903 44.89957 45.71121 45.10773 46.59019 48.64595 10 b
# sol_floo0_v2(dat2) 29.83724 30.56719 32.92777 31.97169 34.10451 38.31037 10 a
all.equal(sol_floo0(dat), sol_floo0_v2(dat2))
#[1] TRUE
这篇关于dplyr - 按组大小过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文