R:data.table vs merge(aggregate())性能 [英] R: data.table vs merge(aggregate()) performance
问题描述
或者更一般的,它是 DT [,。SD [...],by = ...]
与 merge (...))
。
下面是数据和示例:
set.seed(5141)
size = 1e6
df< - data.table(a = rnorm(size),
b = paste0 (字母,大小,T),
样本(字母,大小,T),
样本(字母,大小,T)),
c = ),size,T),
d = sample(seq.Date(as.Date(2015-01-01),
as.Date(2015-05-31),by = day),size,T))
system.time(df [,.SD [d == max(d)],by = c])
#
#50.89 0.00 51.00
system.time(merge(aggregate(d〜c,data = df,max),df))
#用户系统已过
#18.24 0.20 18.45
通常对 data.table没有问题
性能,我很惊讶这个特殊的例子。我不得不通过只取最近(可以同时)出现的一些事件类型子集(聚合)一个相当大的数据框。并保留这些特定事件的其余相关数据。但是, .SD
似乎在这个特定的应用程序中不能很好地扩展。
有更好的数据表方式来处理这种任务?
我们可以使用 .I
获取行索引,并基于该索引子集行。
system.time(df [df [,。I [d == max(d)], by = c] $ V1])$ b $ b#用户系统已过
#5.00 0.09 5.30
@ Heroka的解决方案
system.time(df [,is_max:= d == max(d) c] [is_max == T,])
#用户系统已过
#5.06 0.00 5.12
我的机器上的 aggregate
方法给出了
.time(merge(aggregate(d〜c,data = df,max),df))
pre>
#用户系统已过
#48.62 1.00 50.76
使用
.SD
选项system.time(df [,。SD [d == max(d)],by = c])
#用户系统已过
#151.13 0.40 156.57
使用
data.table
joinsystem.time(df [df [,list(d = max(d)),c],on = c('c' 'd')])
#用户系统已过
#0.58 0.01 0.60
如果我们看一下
merge / aggregate
和==
,它们是不同的功能。通常,与使用data.table
的对应连接相比,aggregate / merge
但是,我们使用==
比较每行(需要一些时间)以及.SD
当与用于行索引的.I
相比时,其效率也相对较低)。.SD
也有[。data.table
]的开销。Or to be more general, it's
DT[,.SD[...],by=...]
versusmerge(aggregate(...))
.Without further ado, here's data and example:
set.seed(5141) size = 1e6 df <- data.table(a = rnorm(size), b = paste0(sample(letters, size, T), sample(letters, size, T), sample(letters, size, T)), c = sample(1:(size/10), size, T), d = sample(seq.Date(as.Date("2015-01-01"), as.Date("2015-05-31"), by="day"), size, T)) system.time(df[,.SD[d == max(d)], by = c]) # user system elapsed # 50.89 0.00 51.00 system.time(merge(aggregate(d ~ c, data = df, max), df)) # user system elapsed # 18.24 0.20 18.45
Usually having no problem with
data.table
performance, I got surprised by this particular example. I had to subset (aggregate) a fairly large data frame by taking only latest (can be simultaneous) occurrences of some event types. And keep the rest of relevant data for those particular events. However, it seems that.SD
doesn't scale well in this particular application.Is there a better "data table way" to tackle this kind of tasks?
解决方案We can use
.I
to get the row index and subset the rows based on that. It should be faster.system.time(df[df[,.I[d == max(d)], by = c]$V1]) # user system elapsed # 5.00 0.09 5.30
@Heroka's solution
system.time(df[,is_max:=d==max(d), by = c][is_max==T,]) # user system elapsed # 5.06 0.00 5.12
The
aggregate
method on my machine givessystem.time(merge(aggregate(d ~ c, data = df, max), df)) # user system elapsed # 48.62 1.00 50.76
with the
.SD
optionsystem.time(df[,.SD[d == max(d)], by = c]) # user system elapsed # 151.13 0.40 156.57
Using the
data.table
joinsystem.time(df[df[, list(d=max(d)) , c], on=c('c', 'd')]) # user system elapsed # 0.58 0.01 0.60
If we look at the comparisons between the
merge/aggregate
and the==
, they are different functions. Usually, theaggregate/merge
method will be slower when compared to the coresponding join withdata.table
. But, instead we are using==
that compares every row (takes some time) along with.SD
for subsetting (which also is relatively less efficient when compared to.I
for row indexing). The.SD
also has the overhead of[.data.table
.这篇关于R:data.table vs merge(aggregate())性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!