子组与data.table组 [英] Subset by group with data.table
问题描述
library(plyr)
library table)
bdt< - as.data.table(baseball)
对于每个玩家(由id给出),我想找到他们玩最多游戏的年份对应的行。这在plyr中很简单:
ddply(baseball,id,subset,g == max(g))
data.table的等效代码是什么?
我尝试:
setkey(bdt,id)
bdt [g == max(g)] #只有一行
bdt [g == max(g),by = id]#错误:提供'by'或'keyby',但不提供j
bdt [,.SD max(g)]]#只有一行
这适用于:
bdt [,.SD [g == max(g)],by = id]
但是它只比plyr快30%,表明这可能不是惯用的。
这是快速的 data.table
方式:
bdt [bdt [,.I [g == max(g)],by = id] $ V1]
这避免构建 .SD
,这是您的表达式中的瓶颈。
实际上,OP运行缓慢的主要原因不仅在于它具有 .SD
,而是它以特定的方式使用它 - 调用 [。data.table
,这在目前有一个巨大的开销,所以运行它在一个循环(当一个通过
)累积了非常大的代价。
Assume I have a data table containing some baseball players:
library(plyr)
library(data.table)
bdt <- as.data.table(baseball)
For each player (given by id), I want to find the row corresponding to the year in which they played the most games. This is straightforward in plyr:
ddply(baseball, "id", subset, g == max(g))
What's the equivalent code for data.table?
I tried:
setkey(bdt, "id")
bdt[g == max(g)] # only one row
bdt[g == max(g), by = id] # Error: 'by' or 'keyby' is supplied but not j
bdt[, .SD[g == max(g)]] # only one row
This works:
bdt[, .SD[g == max(g)], by = id]
But it's is only 30% faster than plyr, suggesting it's probably not idiomatic.
Here's the fast data.table
way:
bdt[bdt[, .I[g == max(g)], by = id]$V1]
This avoids constructing .SD
, which is the bottleneck in your expressions.
edit: Actually, the main reason the OP is slow is not just that it has .SD
in it, but the fact that it uses it in a particular way - by calling [.data.table
, which at the moment has a huge overhead, so running it in a loop (when one does a by
) accumulates a very large penalty.
这篇关于子组与data.table组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!