选择要在 data.table 中保留/删除的组 [英] Choose groups to keep/drop in data.table
本文介绍了选择要在 data.table 中保留/删除的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何根据 data.table 中的条件删除/保留组?有没有比添加新列,然后过滤该列并删除它更好的方法?
How can I drop/keep groups according to a condition in data.table? Is there a better method than adding a new column, then filtering on that column and removing it?
set.seed(0)
dt <- data.table(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# a b
# 1: 1 4
# 2: 1 1
# 3: 1 2
# 4: 2 1
# 5: 2 4
# 6: 2 2
# 7: 3 4
# 8: 3 3
# 9: 3 4
#data.table
dt[, keep := 2 %in% b, by = a][keep == T][, keep := NULL][]
# a b
# 1: 1 5
# 2: 1 2
# 3: 1 2
# 4: 2 3
# 5: 2 5
# 6: 2 2
# dplyr
dt %>%
group_by(a) %>%
filter(2 %in% b)
# # A tibble: 6 x 2
# # Groups: a [2]
# a b
# <int> <int>
# 1 1 5
# 2 1 2
# 3 1 2
# 4 2 3
# 5 2 5
# 6 2 2
基准测试看看 .I
是否更快.2015 Macbook Pro
Benchmark to see if .I
is faster. 2015 Macbook Pro
bench <-
map(10^(4:7)
, ~ {
df <- data.table(name = sample(1:.x, 3*.x, T)
, a = runif(3*.x)
, b = runif(3*.x)
, c = runif(3*.x))
dt <- data.table(a = rep(1:.x, rep(10, .x)), b = sample(1:10, 10*.x, T))
microbenchmark(dt[, if(2 %in% b) .SD, a]
, dt[dt[, .I[2 %in% b], a]$V1] )
})
bench
[[1]]
Unit: milliseconds
expr min lq mean median uq max neval
dt[, if (2 %in% b) .SD, a] 13.04827 17.36046 21.15155 19.19119 22.94641 43.04519 100
dt[dt[, .I[2 %in% b], a]$V1] 17.32547 22.92023 27.09775 24.87586 28.39789 108.47604 100
[[2]]
Unit: milliseconds
expr min lq mean median uq max neval
dt[, if (2 %in% b) .SD, a] 123.9118 143.7802 162.6719 154.4713 173.2986 428.4141 100
dt[dt[, .I[2 %in% b], a]$V1] 158.2975 177.3303 206.3611 193.4460 224.5091 435.3982 100
[[3]]
Unit: seconds
expr min lq mean median uq max neval
dt[, if (2 %in% b) .SD, a] 1.23310 1.351067 1.448680 1.402827 1.517017 1.852797 100
dt[dt[, .I[2 %in% b], a]$V1] 1.58702 1.704344 1.826468 1.778590 1.947943 2.243176 100
[[4]]
Unit: seconds
expr min lq mean median uq max neval
dt[, if (2 %in% b) .SD, a] 14.44317 14.65889 14.79806 14.78217 14.91571 15.29134 100
dt[dt[, .I[2 %in% b], a]$V1] 18.04774 18.36764 18.48804 18.45732 18.53073 20.73805 100
推荐答案
dt
分组后可以在 .SD
中使用 if
条件按列 a
:
You can use an if
condition with .SD
after grouping dt
by column a
:
dt[, if(2 %in% b) .SD, a]
# a b
#1: 1 5
#2: 1 2
#3: 1 2
#4: 2 3
#5: 2 5
#6: 2 2
从?.SD
,.SD 是一个data.table,其中包含每个组的x 数据子集.结合 if
条件,如果 2
不在 b
列中,则不返回任何内容并删除相应的组.
From ?.SD
, .SD is a data.table containing the Subset of x's Data for each group. Combined with if
condition, we return nothing if 2
is not in column b
and drop the corresponding group.
这篇关于选择要在 data.table 中保留/删除的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文