如何按组加快子集 [英] How to speed up subset by groups

查看:162
本文介绍了如何按组加快子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾经用dplyr来实现我的数据争吵,但是一些计算是慢的。特别是按照组的子集,我读到,当有很多组,并且基于



我错过/滥用有数据表的东西?你有想法来加快这个计算吗?



任何帮助都将非常感谢!感谢






编辑:有关用于微型基准的系统和软件包版本的一些细节。 (电脑不是战机,12Go i5)



系统



  sessionInfo()
R版本3.1.3(2015-03-09)
平台:x86_64-w64-mingw32 / x64(64位)
运行于:Windows 7 x64 (build 7601)Service Pack 1

locale:
[1] LC_COLLATE = French_France.1252 LC_CTYPE = French_France.1252
[3] LC_MONETARY = French_France.1252 LC_NUMERIC = C
[5] LC_TIME = French_France.1252

附加的基本包:
[1]统计图形grDevices utils数据集方法base

其他附加包:
[1] readr_0.1.0 ggplot2_1.0.1 microbenchmark_1.4-2
[4] data.table_1.9.4 dplyr_0.4.1 plyr_1.8.2

通过命名空间加载(而不是附件):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-6 DBI_0.3.1
[5] digest_0.6.8 grid_3.1.3 gtable_0.1.2 lazyeval_0.1.10
[ 9] magrittr_1.5 MASS_7 .3-39 munsell_0.4.2 parallel_3.1.3
[13] proto_0.3-10 Rcpp_0.11.5 reshape2_1.4.1 scales_0.2.4
[17] stringi_0.4-1 stringr_0.6.2 tools_3.1.3

> packageVersion(data.table)
[1]'1.9.4'
> packageVersion(dplyr)
[1]'0.4.1'


解决方案

很棒的问题!



我假设 df dt 作为容易/快速打字的对象名称。

  df = datas.tbl 
dt = datas.dt

-O3 级别优化:



首先,这是我的系统上当前CRAN版本的 dplyr 和devel版本的 data.table dplyr 的开发版本似乎遭受了性能回归(由Romain修正)。

  system.time(df%>%group_by(id1,id2)%>%filter(datetime == max(datetime)))
#25.291 0.128 25.610

system.time(dt [dt [,.I [datetime == max(datetime)],by = c(id1,id2)] $ V1])$ ​​b $ b#17.191 0.075 17.349

我跑了很多次,dint似乎改变了。但是,我使用 -O3 优化标记(通过适当地设置〜/ .R / Makevars )编译所有包。而我已经观察到, data.table 性能比其他软件包要好得多,比较它在 -O3



分组速度比较



其次,了解这么慢的原因。首先让我们比较刚刚的时间。

  system.time(group_by(df,id1, id2))
#0.303 0.007 0.311
system.time(data.table ::: forderv(dt,by = c(id1,id2),retGrp = TRUE))
#0.002 0.000 0.002

即使总共有25万行,您的数据大小约为38MB 。在这个大小上,不太可能看到分组速度有明显差异。



data.table 的分组是> 100x 这里显然不是这么慢的原因...



为什么它慢吗?



那么原因是什么?我们开启 datatable.verbose 选项,然后再次检查:

  options datatable.verbose = TRUE)
dt [dt [,.I [datetime == max(datetime)],by = c(id1,id2)] $ V1]
# j使用这些列:datetime
#查找组(bysameorder = TRUE)...以0.002secs完成。 bysameorder = TRUE和o__是长度0
#lapply优化开启,j不变为'.I [datetime == max(datetime)]'
#GForce开,左j不变
#旧平均优化开,左j不变。
#开始dogroups ...
#memcpy连续组对于230000组占用0.097
#eval(j)占用17.129s 230000个电话
#完成小组在17.597秒

所以 eval(j) % 的时间!我们在 j 中提供的表达式被评估为每个组。由于您拥有23万个群组,而对于 eval()来说,这是一个惩罚,所以加起来。



避免 eval()罚款



由于我们知道这一点,我们已经开始实施一些常用功能的内部版本: sum 意味着 min max 。这将/应该扩展到尽可能多的其他功能(当我们找到时间)。



所以,让我们尝试计算刚刚获得 max(datetime) first:

  dt.agg = dt [,。(datetime = max (datetime)),by =。(id1,id2)] 
#检测到j使用这些列:datetime
#查找组(bysameorder = TRUE)...以0.002secs完成。 bysameorder = TRUE和o__是长度0
#lapply优化开启,j不变为'list(max(datetime))'
#GForce优化j到'list(gmax(datetime))'

这是即时的。为什么?因为 max()内部优化为 gmax(),并且没有 eval() / code>调用每个230K组。



所以为什么不是$ code datetime == max(datetime)即时?因为解析这些表达式并在内部进行优化更复杂,我们还没有得到它。



解决方法



所以现在我们知道这个问题,并解决这个问题,让我们来看看。

  dt.agg = dt [,。(datetime = max(datetime)),by =。(id1,id2)] 
dt [dt.agg,on = c(id1,id2 ,datetime)]#v1.9.5 +

我的Mac需要〜0.14秒/ p>

请注意,这只是快速,因为表达式被优化为 gmax()。比较:

  dt [,。(datetime = base :: max(datetime)),by =。(id1, id2)] 






我同意优化更复杂的表达式避免 eval()罚款将是理想的解决方案,但我们还没有。


I used to achieve my data wrangling with dplyr, but some of the computations are "slow". In particular subset by groups, I read that dplyr is slow when there is a lot of groups and based on this benchmark data.table could be faster so I started to learn data.table.

Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.

Datas

# random datetime generation function by Dirk Eddelbuettel
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/05") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
}

set.seed(42)
# Creating 230000 ids couples
ids <- data.frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"), 
                  id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
# Adding random datetime variable and dummy variables to reproduce real datas
datas <- transform(ids, 
                   datetime = rand.datetime(25e4), 
                   var1 = sample(LETTERS[1:6], 25e4, rep = TRUE), 
                   var2 = sample(c(1:10, NA), 25e4, rep = TRUE), 
                   var3 = sample(c(1:10, NA), 25e4, rep = TRUE), 
                   var4 = rand.datetime(25e4), 
                   var5 = rand.datetime(25e4))

datas.tbl <- tbl_df(datas)
datas.dt <- data.table(datas, key = c("id1", "id2"))

I couldn't find the straight way to subset by groups with data.table so I asked this question : Filter rows by groups with data.table

We suggest me to use .SD :

datas.dt[, .SD[datetime == max(datetime)], by = c("id1", "id2")]

But I have two problems, it works with date but not with POSIXct ("Error in UseMethod("as.data.table") : no applicable method for 'as.data.table' applied to an object of class "c('POSIXct', 'POSIXt')""), and this is very slow. For example with Dates :

> system.time({
+   datas.dt[, .SD[as.Date(datetime) == max(as.Date(datetime))], by = c("id1", "id2")]
+ })
 utilisateur     système      écoulé 
      207.03        0.00      207.48 

So I found other way much faster to achieve this (and keeping datetimes) with data.table :

Functions

f.dplyr <- function(x) x %>% group_by(id1, id2) %>% filter(datetime == max(datetime))
f.dt.i <- function(x) x[x[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
f.dt <- function(x) x[x[, datetime == max(datetime), by = c("id1", "id2")]$V1]

But then I thought data.table would be much faster, the time difference with dplyr isn't significative.

Microbenchmark

mbm <- microbenchmark(
  dplyr = res1 <- f.dplyr(datas.tbl), 
  data.table.I = res2 <- f.dt.i(datas.dt), 
  data.table = res3 <- f.dt(datas.dt), 
  times = 50L)

Unit: seconds
         expr      min       lq     mean   median       uq      max neval
        dplyr 31.84249 32.24055 32.59046 32.61311 32.88703 33.54226    50
 data.table.I 30.02831 30.94621 31.19660 31.17820 31.42888 32.16521    50
   data.table 30.28923 30.84212 31.09749 31.04851 31.40432 31.96351    50

Am I missing/misusing something with data.table ? Do you have ideas to speed up this computation ?

Any help would be highly appreciated ! Thanks


Edit : Some precisions about the system and packages versions used for the microbenchmark. (The computer isn't a war machine, 12Go i5)

System

sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
  [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] readr_0.1.0          ggplot2_1.0.1        microbenchmark_1.4-2
[4] data.table_1.9.4     dplyr_0.4.1          plyr_1.8.2          

loaded via a namespace (and not attached):
  [1] assertthat_0.1   chron_2.3-45     colorspace_1.2-6 DBI_0.3.1       
[5] digest_0.6.8     grid_3.1.3       gtable_0.1.2     lazyeval_0.1.10 
[9] magrittr_1.5     MASS_7.3-39      munsell_0.4.2    parallel_3.1.3  
[13] proto_0.3-10     Rcpp_0.11.5      reshape2_1.4.1   scales_0.2.4    
[17] stringi_0.4-1    stringr_0.6.2    tools_3.1.3 

> packageVersion("data.table")
[1] ‘1.9.4’
> packageVersion("dplyr")
[1] ‘0.4.1’

解决方案

Great question!

I'll assume df and dt to be the names of objects for easy/quick typing.

df = datas.tbl
dt = datas.dt

Comparison at -O3 level optimisation:

First, here's the timing on my system on the current CRAN version of dplyr and devel version of data.table. The devel version of dplyr seems to suffer from performance regressions (and is being fixed by Romain).

system.time(df %>% group_by(id1, id2) %>% filter(datetime == max(datetime)))
#  25.291   0.128  25.610 

system.time(dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
#  17.191   0.075  17.349 

I ran this quite a few times, and dint seem to change. However, I compile all packages with -O3 optimisation flag (by setting ~/.R/Makevars appropriately). And I've observed that data.table performance gets much better than other packages I've compared it with at -O3.

Grouping speed comparison

Second, it is important to understand the reason for such slowness. First let's compare the time to just group.

system.time(group_by(df, id1, id2))
#   0.303   0.007   0.311 
system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE))
#   0.002   0.000   0.002 

Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

Why is it slow?

So what's the reason? Let's turn on datatable.verbose option and check again:

options(datatable.verbose = TRUE)
dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as '.I[datetime == max(datetime)]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ... 
#   memcpy contiguous groups took 0.097s for 230000 groups
#   eval(j) took 17.129s for 230000 calls
# done dogroups in 17.597 secs

So eval(j) alone took ~97% of the time! The expression we've provided in j is evaluated for each group. Since you've 230,000 groups, and there's a penalty to the eval() call, that adds up.

Avoiding the eval() penalty

Since we're aware of this penalty, we've gone ahead and started implementing internal versions of some commonly used functions: sum, mean, min, max. This will/should be expanded to as many other functions as possible (when we find time).

So, let's try computing the time for just obtaining max(datetime) first:

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'list(max(datetime))'
# GForce optimized j to 'list(gmax(datetime))'

And it's instant. Why? Because max() gets internally optimised to gmax() and there's no eval() call for each of the 230K groups.

So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

Workaround

So now that we know the issue, and a way to get around it, let's use it.

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
dt[dt.agg, on = c("id1", "id2", "datetime")] # v1.9.5+

This takes ~0.14 seconds on my Mac.

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]


I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

这篇关于如何按组加快子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆