重新探究data.table与dplyr的内存使用 [英] data.table vs dplyr memory use revisited
问题描述
我知道 data.table 与 dplyr 的比较是SO上常年喜欢的.(完全公开:我喜欢并使用这两个程序包.)
I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)
但是,在尝试对我正在教授的课程进行一些比较时,我遇到了一些令人惊讶的事情.内存使用情况.我的期望是 dplyr 在需要(隐式)过滤或切片数据的操作中表现特别差.但这不是我要找到的.比较:
However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:
第一个 dplyr .
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DF = tibble(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#> y `mean(z)`
#> * <chr> <dbl>
#> 1 A -0.00336
#> 2 B -0.00702
#> 3 C 0.00291
#> 4 D -0.00430
#> 5 E -0.00705
#> 6 F -0.00568
#> 7 G -0.00344
#> 8 H 0.000553
#> 9 I -0.00168
#> 10 J 0.00661
bench::bench_process_memory()
#> current max
#> 585MB 611MB
由 reprex软件包(v0.3.0)
Created on 2020-04-22 by the reprex package (v0.3.0)
然后 data.table .
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 948.47MB 1.17GB
由 reprex软件包(v0.3.0)
Created on 2020-04-22 by the reprex package (v0.3.0)
因此,基本上 data.table 似乎正在使用 dplyr 进行此简单的过滤+分组操作的内存几乎是两次.请注意,我实质上是在复制@Arun建议的用例
So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that @Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)
任何想法,还是我只是缺少明显的东西?
Any ideas, or am I just missing something obvious?
P.S.顺便说一句,比较内存使用情况最终会比最初看起来更加复杂,因为R的标准内存配置工具(Rprofmem和co.)全部忽略在R之外发生的操作(例如,对C ++堆栈的调用).幸运的是, bench 包现在提供了 bench_process_memory()
函数还可以跟踪R的GC堆之外的内存,这就是我在这里使用它的原因.
P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory()
function that also tracks memory outside of R’s GC heap, which is why I use it here.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0
#> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8
#> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0
#> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
#> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0
#> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3
#> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3
由 reprex软件包(v0.3.0)
Created on 2020-04-22 by the reprex package (v0.3.0)
推荐答案
UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.
dplyr
$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB
data.table
$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB
底线:从R 复杂.
我将原始答案留在下面,因为我认为它仍然有价值.
I'll leave my original answer below because I think it still has value.
原始答案:
好的,因此在撰写本文的过程中,我意识到 data.table 的默认多线程行为似乎是主要原因.如果我重新运行后一个块,但这次是多线程,则两个结果具有更高的可比性:
Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 589MB 612MB
由 reprex软件包(v0.3.0)
Created on 2020-04-22 by the reprex package (v0.3.0)
还是,我很惊讶他们这么近.如果我尝试使用更大的数据集,则 data.table 的内存性能实际上会变差—尽管使用了单线程—这使我怀疑我仍然无法正确测量内存使用情况...
Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...
这篇关于重新探究data.table与dplyr的内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!