R:data.table vs merge(aggregate())性能 [英] R: data.table vs merge(aggregate()) performance

查看:206
本文介绍了R:data.table vs merge(aggregate())性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

或者更一般的,它是 DT [,。SD [...],by = ...] merge (...))



下面是数据和示例:

  set.seed(5141)
size = 1e6
df< - data.table(a = rnorm(size),
b = paste0 (字母,大小,T),
样本(字母,大小,T),
样本(字母,大小,T)),
c = ),size,T),
d = sample(seq.Date(as.Date(2015-01-01),
as.Date(2015-05-31),by = day),size,T))

system.time(df [,.SD [d == max(d)],by = c])

#50.89 0.00 51.00
system.time(merge(aggregate(d〜c,data = df,max),df))
#用户系统已过
#18.24 0.20 18.45

通常对 data.table没有问题性能,我很惊讶这个特殊的例子。我不得不通过只取最近(可以同时)出现的一些事件类型子集(聚合)一个相当大的数据框。并保留这些特定事件的其余相关数据。但是, .SD 似乎在这个特定的应用程序中不能很好地扩展。



有更好的数据表方式来处理这种任务?

解决方案

我们可以使用 .I 获取行索引,并基于该索引子集行。

  system.time(df [df [,。I [d == max(d)], by = c] $ V1])$ ​​b $ b#用户系统已过
#5.00 0.09 5.30

@ Heroka的解决方案

  system.time(df [,is_max:= d == max(d) c] [is_max == T,])
#用户系统已过
#5.06 0.00 5.12

我的机器上的 aggregate 方法给出了

  .time(merge(aggregate(d〜c,data = df,max),df))
#用户系统已过
#48.62 1.00 50.76
pre>

使用 .SD 选项

  system.time(df [,。SD [d == max(d)],by = c])
#用户系统已过
#151.13 0.40 156.57

使用 data.table join

  system.time(df [df [,list(d = max(d)),c],on = c('c' 'd')])
#用户系统已过
#0.58 0.01 0.60






如果我们看一下 merge / aggregate == ,它们是不同的功能。通常,与使用 data.table 的对应连接相比, aggregate / merge 但是,我们使用 == 比较每行(需要一些时间)以及 .SD 当与用于行索引的 .I 相比时,其效率也相对较低)。 .SD 也有 [。data.table ]的开销。


Or to be more general, it's DT[,.SD[...],by=...] versus merge(aggregate(...)).

Without further ado, here's data and example:

set.seed(5141)
size = 1e6
df <- data.table(a = rnorm(size),
                 b = paste0(sample(letters, size, T), 
                            sample(letters, size, T), 
                            sample(letters, size, T)),
                 c = sample(1:(size/10), size, T),
                 d = sample(seq.Date(as.Date("2015-01-01"), 
                                     as.Date("2015-05-31"), by="day"), size, T))

system.time(df[,.SD[d == max(d)], by = c])
# user  system elapsed 
# 50.89    0.00   51.00 
system.time(merge(aggregate(d ~ c, data = df, max), df))
# user  system elapsed 
# 18.24    0.20   18.45 

Usually having no problem with data.table performance, I got surprised by this particular example. I had to subset (aggregate) a fairly large data frame by taking only latest (can be simultaneous) occurrences of some event types. And keep the rest of relevant data for those particular events. However, it seems that .SD doesn't scale well in this particular application.

Is there a better "data table way" to tackle this kind of tasks?

解决方案

We can use .I to get the row index and subset the rows based on that. It should be faster.

system.time(df[df[,.I[d == max(d)], by = c]$V1])
#    user  system elapsed 
#   5.00    0.09    5.30 

@Heroka's solution

system.time(df[,is_max:=d==max(d), by = c][is_max==T,])
#   user  system elapsed 
#  5.06    0.00    5.12 

The aggregate method on my machine gives

system.time(merge(aggregate(d ~ c, data = df, max), df))
#   user  system elapsed 
#  48.62    1.00   50.76 

with the .SD option

system.time(df[,.SD[d == max(d)], by = c])
#   user  system elapsed 
# 151.13    0.40  156.57 

Using the data.table join

system.time(df[df[, list(d=max(d)) , c], on=c('c', 'd')])
#   user  system elapsed 
#   0.58    0.01    0.60 


If we look at the comparisons between the merge/aggregate and the ==, they are different functions. Usually, the aggregate/merge method will be slower when compared to the coresponding join with data.table. But, instead we are using == that compares every row (takes some time) along with .SD for subsetting (which also is relatively less efficient when compared to .I for row indexing). The .SD also has the overhead of [.data.table.

这篇关于R:data.table vs merge(aggregate())性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆