使用data.table进行内存分析 [英] Memory profiling with data.table

查看:55
本文介绍了使用data.table进行内存分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在包含对 data.table 函数的调用的R代码中,配置内存的正确方法是什么?假设我要确定表达式期间的最大内存使用量.

What is the correct way to profile memory in R code that contains calls to data.table functions? Let's say I want to determine the maximum memory usage during an expression.

此参考文献表明 Rprofmem 可能不是正确的选择: https://cran.r-project.org/web/包/profmem/vignettes/profmem.html

This reference indicates that Rprofmem may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html

通过R的本机allocVector3()部分完成的所有内存分配将记录本机API,这意味着几乎所有内存分配都被记录.用这种方法分配的所有对象都会被R的垃圾自动释放.收藏家在某个时候.profmem()不会记录垃圾收集事件.未记录的分配是由非R本机库或R包完成的分配对内部对象使用本机代码Calloc()/Free()的代码.这些对象是不是由R垃圾收集器处理的.

All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector.

data.table源代码包含对 Calloc() malloc()的大量调用,因此这表明 Rprofmem 将无法测量 data.table 函数分配的所有内存.如果 Rprofmem 不是正确的工具,那么Matthew Dowle怎么在这里使用它:

The data.table source code contains plenty of calls to Calloc() and malloc() so this suggests that Rprofmem will not measure all memory allocated by data.table functions. If Rprofmem is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table?

我找到了一个参考文献,提出了与 gc()类似的潜在问题(可用于测量两次 gc()调用之间的最大内存使用量): https://r.789695.n4.nabble.com/确定最大功能的使用情况-td4669977.html

I've found a reference suggesting similar potential issues for gc() (which can be used to measure maximum memory usage between two calls to gc()): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html

gc()是一个好的开始.在调用gc(reset = TRUE)之前和gc()完成任务后,您将看到R在临时.(这不包括编译代码分配的内存,它在重复使用时很难测量.)

gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.)

我发现没有发现 Rprof(memory.profiling = TRUE)存在类似问题.这是否意味着即使不总是使用R API分配内存, Rprof 方法仍适用于 data.table ?

Nothing I've found suggests that similar issues exist with Rprof(memory.profiling=TRUE). Does this mean that the Rprof approach will work for data.table even though it doesn't always use the R API to allocate memory?

如果 Rprof(memory.profiling = TRUE)实际上不是适合该工作的工具,那是什么?

If Rprof(memory.profiling=TRUE) in fact is not the right tool for the job, what is?

ssh.utils :: mem.usage 是否可以工作?

推荐答案

这与data.table不相关.最近在Twitter上有关于相同dplyr行为的讨论: https://mobile.twitter.com/healthandstats/status/1182840075001819136

This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136

/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident

还有一个有趣的 cgmemtime 项目,但需要更多设置.

There is also interesting cgmemtime project, but it requires a little bit more setup.

如果您使用的是Windows,建议您迁移到Linux.

If you are on Windows I suggest you to move to Linux.

这篇关于使用data.table进行内存分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆