dplyr - 按组大小过滤 [英] dplyr - filter by group size

查看：18 发布时间：2021/12/23 12:16:47 r dataframe filter dplyr subset

本文介绍了dplyr - 按组大小过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

过滤 data.frame 以仅获取大小为 5 的组的最佳方法是什么?

What is the best way to filter a data.frame to only get groups of say size 5?

所以我的数据如下所示:

So my data looks as follows:

require(dplyr)
n <- 1e5
x <- rnorm(n)
# Category size ranging each from 1 to 5
cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n]

dat <- data.frame(x = x, cat = cat)

我能想到的 dplyr 方法是

The dplyr way i could come up with was

dat <- group_by(dat, cat)

system.time({
  out1 <- dat %>% filter(n() == 5L)
})
#    user  system elapsed 
#   1.157   0.218   1.497

但这很慢... dplyr 有没有更好的方法?

But this is very slow... Is there a better way in dplyr?

到目前为止，我的解决方法如下:

So far my workaround solutions looks as follows:

system.time({
  all_ind <- rep(seq_len(n_groups(dat)), group_size(dat))
  take_only <- which(group_size(dat) == 5L)
  out2 <- dat[all_ind %in% take_only, ]
})
#    user  system elapsed 
#   0.026   0.008   0.036
all.equal(out1, out2) # TRUE

但是这感觉不太像...

But this doesn't feel very dplyr like...

推荐答案

这是另一种你可以尝试的 dplyr 方法

Here's another dplyr approach you can try

semi_join(dat, count(dat, cat) %>% filter(n == 5), by = "cat")

这是另一种基于 OP 原始方法并稍作修改的方法:

Here's another approach based on OP's original approach with a little modification:

n <- 1e5
x <- rnorm(n)
# Category size ranging each from 1 to 5
cat <- rep(seq_len(n/3), sample(1:5, n/3, replace = TRUE))[1:n]

dat <- data.frame(x = x, cat = cat)

# second data set for the dt approch
dat2 <- data.frame(x = x, cat = cat)

sol_floo0 <- function(dat){
  dat <- group_by(dat, cat)
  all_ind <- rep(seq_len(n_groups(dat)), group_size(dat))
  take_only <- which(group_size(dat) == 5L)
  dat[all_ind %in% take_only, ]
}

sol_floo0_v2 <- function(dat){
  g <- group_by(dat, cat) %>% group_size()
  ind <- rep(g == 5, g)
  dat[ind, ]
}



microbenchmark::microbenchmark(times = 10,
                               sol_floo0(dat),
                               sol_floo0_v2(dat2))
#Unit: milliseconds
#               expr      min       lq     mean   median       uq      max neval cld
#     sol_floo0(dat) 43.72903 44.89957 45.71121 45.10773 46.59019 48.64595    10   b
# sol_floo0_v2(dat2) 29.83724 30.56719 32.92777 31.97169 34.10451 38.31037    10  a 
all.equal(sol_floo0(dat), sol_floo0_v2(dat2))
#[1] TRUE

这篇关于dplyr - 按组大小过滤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

dplyr - 按组大小过滤 [英] dplyr - filter by group size

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

dplyr - 按组大小过滤 [英] dplyr - filter by group size

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭