在data.table中使用ifelse选择每个组一行 [英] select one row per group with ifelse in data.table
问题描述
我对一个data.table进行分组,并且希望从每个组中选择第一行(其中x == 1),或者,如果这样的行不存在,则第一行x中的任何值
I'm grouping a data.table and want to select from each group the first row where x == 1 or, if such a row does not exist, then the first row with any value in x
d <- data.table(
a = c(1,1,1, 2,2, 3,3),
x = c(0,1,0, 0,0, 1,1),
y = c(1,2,3, 1,2, 1,2)
)
此尝试
d[, ifelse(any(.SD[,x] == 1),.SD[x == 1][1], .SD[1]), by = a]
返回
a V1
1: 1 1
2: 2 0
3: 3 1
但我预期
a x y
1: 1 1 2
2: 2 0 1
3: 3 1 1
任何想法如何正确?
推荐答案
我们也可以使用 .I
来返回行索引并将其用于子集
We can also do this with .I
to return the row index and use that for subsetting the rows.
d[d[, .I[which.max(x==1)], by = a]$V1]
# a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
在当前版本的 data.table
, .I
方法比用于子集化行的
.SD
更有效(但是,它可以在将来更改)。这也是一个类似的帖子
In the current versions of data.table
, .I
approach is more efficient compared to the .SD
for subsetting rows (However, it could change in the future). This is also a similar post
这里是 order
( setkey $也可以使用c $ c>来提高数据集的效率),然后得到
head
的第一行。
Here is another option with order
(setkey
can also be used - for efficiency) the dataset by 'a' and 'x' after grouping by 'a', and then get the first row with head
d[order(a ,-x), head(.SD, 1) ,by = a]
# a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
基准
最初,我们考虑在> 1e6上进行基准测试,但 .SD
方法需要时间,因此使用
data.table_1.9.7
set.seed(24)
d1 <- data.table(a = rep(1:1e5, 3), x = sample(0:1, 1e5*3,
replace=TRUE), y = rnorm(1e5*3))
system.time(d1[, .SD[which.max(x == 1)], by = a])
# user system elapsed
# 56.21 30.64 86.42
system.time(d1[, .SD[match(1L, x, nomatch = 1L)], by = a])
# user system elapsed
# 55.27 30.07 83.75
system.time(d1[d1[, .I[which.max(x==1)], by = a]$V1])
# user system elapsed
# 0.19 0.00 0.19
system.time(d1[order(a ,-x), head(.SD, 1) ,by = a])
# user system elapsed
# 0.03 0.00 0.04
这篇关于在data.table中使用ifelse选择每个组一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!