子组与data.table组 [英] Subset by group with data.table

查看:131
本文介绍了子组与data.table组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个包含一些棒球运动员的数据表:

  library(plyr)
library table)

bdt< - as.data.table(baseball)

对于每个玩家(由id给出),我想找到他们玩最多游戏的年份对应的行。这在plyr中很简单:

  ddply(baseball,id,subset,g == max(g))

data.table的等效代码是什么?



我尝试:

  setkey(bdt,id)
bdt [g == max(g)] #只有一行
bdt [g == max(g),by = id]#错误:提供'by'或'keyby',但不提供j
bdt [,.SD max(g)]]#只有一行

这适用于:

  bdt [,.SD [g == max(g)],by = id] 

但是它只比plyr快30%,表明这可能不是惯用的。

解决方案

这是快速的 data.table 方式:

  bdt [bdt [,.I [g == max(g)],by = id] $ V1] 

这避免构建 .SD ,这是您的表达式中的瓶颈。



实际上,OP运行缓慢的主要原因不仅在于它具有 .SD ,而是它以特定的方式使用它 - 调用 [。data.table ,这在目前有一个巨大的开销,所以运行它在一个循环(当一个通过)累积了非常大的代价。


Assume I have a data table containing some baseball players:

library(plyr)
library(data.table)

bdt <- as.data.table(baseball)

For each player (given by id), I want to find the row corresponding to the year in which they played the most games. This is straightforward in plyr:

ddply(baseball, "id", subset, g == max(g))

What's the equivalent code for data.table?

I tried:

setkey(bdt, "id") 
bdt[g == max(g)]  # only one row
bdt[g == max(g), by = id]  # Error: 'by' or 'keyby' is supplied but not j
bdt[, .SD[g == max(g)]] # only one row

This works:

bdt[, .SD[g == max(g)], by = id] 

But it's is only 30% faster than plyr, suggesting it's probably not idiomatic.

解决方案

Here's the fast data.table way:

bdt[bdt[, .I[g == max(g)], by = id]$V1]

This avoids constructing .SD, which is the bottleneck in your expressions.

edit: Actually, the main reason the OP is slow is not just that it has .SD in it, but the fact that it uses it in a particular way - by calling [.data.table, which at the moment has a huge overhead, so running it in a loop (when one does a by) accumulates a very large penalty.

这篇关于子组与data.table组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆