对于高基数分组,为什么使用dplyr管道(%>%)比等效的非管道表达式要慢? [英] Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?
问题描述
我认为通常来说,使用%>%
不会对速度产生明显影响。但是在这种情况下,它的运行速度要慢4倍。
I thought that generally speaking using %>%
wouldn't have a noticeable effect on speed. But in this case it runs 4x slower.
library(dplyr)
library(microbenchmark)
set.seed(0)
dummy_data <- dplyr::data_frame(
id=floor(runif(10000, 1, 10000))
, label=floor(runif(10000, 1, 4))
)
microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))
不带管道:
min lq mean median uq max neval
1.691441 1.739436 1.841157 1.812778 1.880713 2.495853 100
带管道:
min lq mean median uq max neval
6.753999 6.969573 7.167802 7.052744 7.195204 8.833322 100
为什么%&%;%
在这种情况下要慢得多吗?有没有更好的方法可以写呢?
Why is %>%
so much slower in this situation? Is there a better way to write this?
编辑:
我缩小了数据框并合并了Moody_Mudskipper的
I made the data frame smaller and incorporated Moody_Mudskipper's suggestions into the benchmarking.
microbenchmark(
nopipe=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
magrittr=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
magrittr2=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label %.% unique(.) %.% list(.))
)
Unit: milliseconds
expr min lq mean median uq max neval
nopipe 59.91252 70.26554 78.10511 72.79398 79.29025 214.9245 100
magrittr 469.09573 525.80084 568.28918 558.05634 590.48409 767.4647 100
magrittr2 84.06716 95.20952 106.28494 100.32370 110.92373 241.1296 100
fastpipe 93.57549 103.36926 109.94614 107.55218 111.90049 162.7763 100
推荐答案
可能忽略不计当编写依赖于以前可忽略不计的时间依赖的单行代码时,在现实世界中完整应用程序中的效果变得不可忽略。我怀疑如果您对测试进行了概要分析,那么大多数时间都将在 summerize
子句中,因此让微基准测试类似于以下内容:
What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize
clause, so lets microbenchmark something similar to that:
> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
expr min lq mean median uq max neval
z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735 100
list(unique(z)) 9.289 9.988 10.85705 10.5820 11.804 12.642 100
这与您的代码有些不同,但说明了这一点。管道比较慢。
This is doing something a bit different to your code but illustrates the point. Pipes are slower.
因为管道需要将R的调用重组为函数评估所使用的那个,然后对其进行评估。因此,必须变慢。多少取决于功能的速度。在R中,对 unique
和 list
的调用非常快,因此这里的全部区别是管道开销。
Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique
and list
are pretty fast in R, so the whole difference here is the pipe overhead.
像这样的分析表达式向我展示了大部分时间都花在了管道函数上:
Profiling expressions like this showed me most of the time is spent in the pipe functions:
total.time total.pct self.time self.pct
"microbenchmark" 16.84 98.71 1.22 7.15
"%>%" 15.50 90.86 1.22 7.15
"eval" 5.72 33.53 1.18 6.92
"split_chain" 5.60 32.83 1.92 11.25
"lapply" 5.00 29.31 0.62 3.63
"FUN" 4.30 25.21 0.24 1.41
..... stuff .....
然后完成实际工作的第15位:
then somewhere down in about 15th place the real work gets done:
"as.list" 1.40 8.13 0.66 3.83
"unique" 1.38 8.01 0.88 5.11
"rev" 1.26 7.32 0.90 5.23
total.time total.pct self.time self.pct
"microbenchmark" 2.30 96.64 1.04 43.70
"unique" 1.12 47.06 0.38 15.97
"unique.default" 0.74 31.09 0.64 26.89
"is.factor" 0.10 4.20 0.10 4.20
因此,经常被引用的建议是在命令行上管道是可以的,但您的大脑需要链式思考,但不能在功能上思考这可能很关键。实际上,一次调用带有数百个数据点的 glm
可能会消除这些开销,但这又是另一回事了……。
Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm
with a few hundred data points, but that's another story....
这篇关于对于高基数分组,为什么使用dplyr管道(%>%)比等效的非管道表达式要慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!