重新探究data.table与dplyr的内存使用 [英] data.table vs dplyr memory use revisited

查看:33
本文介绍了重新探究data.table与dplyr的内存使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 data.table dplyr 的比较是SO上常年喜欢的.(完全公开:我喜欢并使用这两个程序包.)

I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)

但是,在尝试对我正在教授的课程进行一些比较时,我遇到了一些令人惊讶的事情.内存使用情况.我的期望是 dplyr 在需要(隐式)过滤或切片数据的操作中表现特别差.但这不是我要找到的.比较:

However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:

第一个 dplyr .

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)

DF = tibble(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#>    y     `mean(z)`
#>  * <chr>     <dbl>
#>  1 A     -0.00336 
#>  2 B     -0.00702 
#>  3 C      0.00291 
#>  4 D     -0.00430 
#>  5 E     -0.00705 
#>  6 F     -0.00568 
#>  7 G     -0.00344 
#>  8 H      0.000553
#>  9 I     -0.00168 
#> 10 J      0.00661

bench::bench_process_memory()
#> current     max 
#>   585MB   611MB

reprex软件包(v0.3.0)

Created on 2020-04-22 by the reprex package (v0.3.0)

然后 data.table .

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)

DT = data.table(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DT[x > 7, mean(z), by = y]
#>     y            V1
#>  1: F -0.0056834238
#>  2: I -0.0016755202
#>  3: J  0.0066061660
#>  4: G -0.0034436348
#>  5: B -0.0070242788
#>  6: E -0.0070462070
#>  7: H  0.0005525803
#>  8: D -0.0043024627
#>  9: A -0.0033609302
#> 10: C  0.0029146372

bench::bench_process_memory()
#>  current      max 
#> 948.47MB   1.17GB

reprex软件包(v0.3.0)

Created on 2020-04-22 by the reprex package (v0.3.0)

因此,基本上 data.table 似乎正在使用 dplyr 进行此简单的过滤+分组操作的内存几乎是两次.请注意,我实质上是在复制@Arun建议的用例

So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that @Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)

任何想法,还是我只是缺少明显的东西?

Any ideas, or am I just missing something obvious?

P.S.顺便说一句,比较内存使用情况最终会比最初看起来更加复杂,因为R的标准内存配置工具(Rprofmem和co.)全部忽略在R之外发生的操作(例如,对C ++堆栈的调用).幸运的是, bench 包现在提供了 bench_process_memory() 函数还可以跟踪R的GC堆之外的内存,这就是我在这里使用它的原因.

P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory() function that also tracks memory outside of R’s GC heap, which is why I use it here.

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6      knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  stringr_1.4.0     highr_0.8        
#>  [9] tools_3.6.3       xfun_0.13         htmltools_0.4.0   ellipsis_0.3.0   
#> [13] yaml_2.2.1        digest_0.6.25     tibble_3.0.1      lifecycle_0.2.0  
#> [17] crayon_1.3.4      purrr_0.3.4       vctrs_0.2.99.9011 glue_1.4.0       
#> [21] evaluate_0.14     rmarkdown_2.1     stringi_1.4.6     compiler_3.6.3   
#> [25] pillar_1.4.3      generics_0.0.2    pkgconfig_2.0.3

reprex软件包(v0.3.0)

Created on 2020-04-22 by the reprex package (v0.3.0)

推荐答案

更新:按照@jangorecki的建议,我使用

UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.

dplyr

$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user:    0.526 s
Child sys :    0.033 s
Child wall:    0.455 s
Child high-water RSS                    :     128952 KiB
Recursive and acc. high-water RSS+CACHE :     118516 KiB

data.table

$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user:    0.510 s
Child sys :    0.056 s
Child wall:    0.464 s
Child high-water RSS                    :     129032 KiB
Recursive and acc. high-water RSS+CACHE :     118320 KiB

底线:从R 复杂.

我将原始答案留在下面,因为我认为它仍然有价值.

I'll leave my original answer below because I think it still has value.

原始答案:

好的,因此在撰写本文的过程中,我意识到 data.table 的默认多线程行为似乎是主要原因.如果我重新运行后一个块,但这次是多线程,则两个结果具有更高的可比性:

Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING

DT = data.table(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DT[x > 7, mean(z), by = y]
#>     y            V1
#>  1: F -0.0056834238
#>  2: I -0.0016755202
#>  3: J  0.0066061660
#>  4: G -0.0034436348
#>  5: B -0.0070242788
#>  6: E -0.0070462070
#>  7: H  0.0005525803
#>  8: D -0.0043024627
#>  9: A -0.0033609302
#> 10: C  0.0029146372

bench::bench_process_memory()
#> current     max 
#>   589MB   612MB

reprex软件包(v0.3.0)

Created on 2020-04-22 by the reprex package (v0.3.0)

还是,我很惊讶他们这么近.如果我尝试使用更大的数据集,则 data.table 的内存性能实际上会变差—尽管使用了单线程—这使我怀疑我仍然无法正确测量内存使用情况...

Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...

这篇关于重新探究data.table与dplyr的内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆