对于高基数分组,为什么使用 dplyr 管道(%>%)比等效的非管道表达式慢? [英] Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

查看:11
本文介绍了对于高基数分组,为什么使用 dplyr 管道(%>%)比等效的非管道表达式慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为一般来说使用 %>% 不会对速度产生明显影响.但在这种情况下,它的运行速度要慢 4 倍.

I thought that generally speaking using %>% wouldn't have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr)
library(microbenchmark)

set.seed(0)
dummy_data <- dplyr::data_frame(
  id=floor(runif(10000, 1, 10000))
  , label=floor(runif(10000, 1, 4))
)

microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))

不带管道:

min       lq     mean   median       uq      max neval
1.691441 1.739436 1.841157 1.812778 1.880713 2.495853   100

带管道:

min       lq     mean   median       uq      max neval
6.753999 6.969573 7.167802 7.052744 7.195204 8.833322   100

为什么在这种情况下 %>% 会慢这么多?有没有更好的写法?

Why is %>% so much slower in this situation? Is there a better way to write this?

我缩小了数据框并将 Moody_Mudskipper 的建议纳入了基准测试.

I made the data frame smaller and incorporated Moody_Mudskipper's suggestions into the benchmarking.

microbenchmark(
  nopipe=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
  magrittr=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
  magrittr2=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
  fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label %.% unique(.) %.% list(.))
)

Unit: milliseconds
      expr       min        lq      mean    median        uq      max neval
    nopipe  59.91252  70.26554  78.10511  72.79398  79.29025 214.9245   100
  magrittr 469.09573 525.80084 568.28918 558.05634 590.48409 767.4647   100
 magrittr2  84.06716  95.20952 106.28494 100.32370 110.92373 241.1296   100
  fastpipe  93.57549 103.36926 109.94614 107.55218 111.90049 162.7763   100

推荐答案

当编写时间依赖于以前可忽略不计"的单行代码时,在现实世界的完整应用程序中可以忽略的影响变得不可忽略.我怀疑如果你分析你的测试,那么大部分时间都会在 summarize 子句中,所以让微基准测试类似于:

What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
                  expr     min      lq      mean   median      uq     max neval
 z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735   100
       list(unique(z))   9.289   9.988  10.85705  10.5820  11.804  12.642   100

这与您的代码有所不同,但说明了这一点.管道速度较慢.

This is doing something a bit different to your code but illustrates the point. Pipes are slower.

因为管道需要将 R 的调用重构为函数评估正在使用的相同调用,然后对其进行评估.所以它必须变慢.多少取决于功能的速度.对 uniquelist 的调用在 R 中非常快,所以这里的全部区别在于管道开销.

Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

像这样的分析表达式告诉我大部分时间都花在管道函数上:

Profiling expressions like this showed me most of the time is spent in the pipe functions:

                         total.time total.pct self.time self.pct
"microbenchmark"              16.84     98.71      1.22     7.15
"%>%"                         15.50     90.86      1.22     7.15
"eval"                         5.72     33.53      1.18     6.92
"split_chain"                  5.60     32.83      1.92    11.25
"lapply"                       5.00     29.31      0.62     3.63
"FUN"                          4.30     25.21      0.24     1.41
 ..... stuff .....

然后在大约第 15 位的某个地方完成了真正的工作:

then somewhere down in about 15th place the real work gets done:

"as.list"                      1.40      8.13      0.66     3.83
"unique"                       1.38      8.01      0.88     5.11
"rev"                          1.26      7.32      0.90     5.23

然而,如果您只是按照钱伯斯的意图调用函数,R 就会直接调用:

Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                         total.time total.pct self.time self.pct
"microbenchmark"               2.30     96.64      1.04    43.70
"unique"                       1.12     47.06      0.38    15.97
"unique.default"               0.74     31.09      0.64    26.89
"is.factor"                    0.10      4.20      0.10     4.20

因此,经常引用的建议是,管道在命令行中可以使用,您的大脑可以在其中进行链式思考,但不适用于可能对时间要求严格的函数.在实践中,这种开销可能会在一次对 glm 的调用中消除,其中包含几百个数据点,但那是另一回事了....

Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

这篇关于对于高基数分组,为什么使用 dplyr 管道(%>%)比等效的非管道表达式慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆