data.table中的内存泄漏通过引用分组分配 [英] Memory leak in data.table grouped assignment by reference
问题描述
在 data.table
中,按组引用使用赋值时,我看到奇怪的内存使用情况。这里有一个简单的例子来证明(请原谅这个例子的琐碎):
N < - 1e6
/ pre>
dt < - data.table(id = round(rnorm(N)),value = rnorm(N))
gc()
for(i in seq b $ b dt [,value:= value + 1,by =id]
}
gc()
tables()
其产生以下输出:
gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for(i in seq(100)){
+ dt [,value:= value + 1,by =id]
+}
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8
Vcells 59966825 457.6 73320781 559.4 69633650 531.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
总计:16MB
在循环后添加了大约440MB的已使用Vcell内存。从内存中删除data.table后,不会考虑此内存:
> rm(dt)
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 320888 17.2 597831 32 407500 21.8
Vcells 57977069 442.4 77066820 588 69633650 531.3
> tables()
在.GlobalEnv中没有类data.table的对象
内存泄漏从作业中删除by = ...时似乎消失了:
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 312955 16.8 597831 32.0 467875 25.0
Vcells 2458890 18.8 3279586 25.1 2704448 20.7
> for(i in seq(100)){
+ dt [,value:= value + 1]
+}
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 322698 17.3 597831 32.0 467875 25.0
Vcells 2478772 19.0 5826337 44.5 5139567 39.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
总计:16MB
总结一下,有两个问题:
- 我缺少一些东西, ?
- 如果确实有内存泄漏,任何人都可以建议一个解决方法,让我使用组的引用分配,而不会发生内存泄漏?
作为参考,下面是
sessionInfo()
的输出:R版本3.0.2(2013-09-25)
平台:x86_64-pc-linux-gnu(64位)
语言环境:
[1] LC_CTYPE = en_US.UTF-8 LC_NUMERIC = C LC_TIME = en_US.UTF-8 LC_COLLATE = en_US.UTF-8 LC_MONETARY = zh_US.UTF-8
[6] LC_MESSAGES = zh_US.UTF -8 LC_PAPER = en_US.UTF-8 LC_NAME = C LC_ADDRESS = C LC_TELEPHONE = C
[11] LC_MEASUREMENT = en_US.UTF-8 LC_IDENTIFICATION = C
附加基本套件:
[1] stats graphics grDevices utils数据集方法base
其他附加包:
[1] data.table_1.8.10
通过命名空间加载附件):
[1] tools_3.0.2
解决方案来自Matt的更新 - 现在固定在v1.8.11。从新闻 :
分组固定的长时间未完成(通常很小)的内存泄漏。当
最后一组小于最大组时,
那些尺寸的差异没有被释放。大多数用户运行分组查询
一次,并且永远不会注意到,但任何循环调用
(如并行运行或基准测试)的调用可能遭受了
#2648。测试已添加。
非常感谢vc273,YT和其他人。
<
来自Arun ...
为什么会发生?
我希望我遇到了 此帖子 ,然后再讨论此问题。然而,一个很好的学习经验。 Simon Urbanek非常简洁地总结了这个问题,它不是内存泄漏,而是使用/释放内存的错误报告。 我感觉这是发生了什么。
在
data.table
?这部分是从dogroups.c
识别代码的部分,负责显着的记忆增加。
所以经过一些冗长的测试,我想我已经设法至少找到原因是这种情况发生。希望有人可以帮助我从这个职位。我的结论是,这是不内存泄漏。
简短的解释是,这似乎是一个使用
SETLENGTH
函数(来自R的C接口)在data.table的dogroups.c
中的效果。当您使用by = ... $ c $>时,
通过首先分配在
data.table
c>,例如set.seed(45)
DT< - data.table(x =样本(3,12,TRUE),id = rep(3:1,c(2,4,6)))
DT [,list(y = mean(x)),by = id]
对应于
$ <$> $ <$> $ <$> $ <$> data.tableid = 1
x(= c(1,2,1,1,2,3)
)。这意味着必须为.SD
(中的所有列不在
)分配内存,$
.SD
与中最大的组的长度来巧妙地实现
(这里对应于id = 1
,长度6)。然后,对于id
的每个值,重用(过分)分配的data.table并使用函数SETLENGTH
我们可以将长度调整为当前组的长度。注意,通过这样做,除了只分配给最大的组的一次,这里没有实际分配的内存。
但是奇怪的是,
by 中的每个组都具有相同数量的项,对于
gc()$ cc没有什么特别的发生$ c>输出。然而,当它们不相同时,
gc()
似乎报告Vcell中使用量的增加。这是尽管在这两种情况下没有分配额外的存储器的事实。
为了说明这一点,我编写了一个C模拟
SETLENGTH
中的
dogroups.c
> // test.c
#include< R.h>
#define USE_RINTERNALS
#include< Rinternals.h>
#include< Rdefines.h>
int sizes [100];
#define SIZEOF(x)sizes [TYPEOF(x)]
//测试函数 - 没有检查!
SEXP测试(SEXP vec,SEXP SD,SEXP长度)
{
R_len_t i,j;
char before_address [32],after_address [32];
SEXP tmp,ans;
PROTECT(tmp = allocVector(INTSXP,1));
PROTECT(ans = allocVector(STRSXP,2));
snprintf(before_address,32,%p,(void *)SD);
for(i = 0; imemcpy((char *)DATAPTR(SD),(char *)DATAPTR(vec),INTEGER ] * SIZEOF(tmp));
SETLENGTH(SD,INTEGER(lengths)[i]);
//在这里做一些计算。例如:mean(SD)
}
snprintf(after_address,32,%p,(void *)SD);
SET_STRING_ELT(ans,0,mkChar(before_address));
SET_STRING_ELT(ans,1,mkChar(after_address));
UNPROTECT(2);
return(ans);
}
这里
vec
等价于任何data.tabledt
和SD
等效于.SD
和lengths
是每个组的长度。这只是一个虚拟程序。基本上对于长度
的每个值,例如n
,第一个n
元素从vec
复制到SD
。然后可以计算任何人想要的SD(这是不是在这里做)。对于我们的目的,将返回使用SETLENGTH操作前后的SD的地址,以说明SETLENGTH没有进行复制。
将此文件另存为
test.c
,然后从终端编译如下:R CMD SHLIB -o test.so test.c
现在,打开一个新的会话,转到路径其中
test.so
存在,然后键入:dyn.load test.so)
require(data.table)
set.seed(45)
max_len< - as.integer(1e6)
lengths& integer(sample(4:(max_len)/ 10,max_len / 10))
gc()
vec < - 1:max_len
for(i in 1:100){
SD <-vec [1:max(lengths)]
bla < - .Call(test,vec,SD,lengths)
print(gc())
}
请注意,对于每个
i
.SD
将被分配不同的内存位置,并通过为每个SD
> i 。
运行此代码,您会发现1)每个
i
返回的两个值是相同的地址(SD)
和2)Vcell使用的Mb
不断增加。现在,使用rm(list = ls())
删除工作区中的所有变量,然后执行gc()
初始:
code> used(Mb)gc trigger(Mb)max used(Mb)
Ncells 332708 17.8 597831 32.0 467875 25.0
Vcells 1033531 7.9 2327578 17.8 2313676 17.7
100次运行后:
Mb)gc trigger(Mb)max(Mb)
Ncells 332912 17.8 597831 32.0 467875 25.0
Vcells 2631370 20.1 4202816 32.1 2765872 21.2
rm(list = ls())
和gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 341275 18.3 597831 32.0 467875 25.0
Vcells 2061531 15.8 4202816 32.1 3121469 23.9
如果删除
SETLENGTH(SD,...)
从C代码,并再次运行,你会发现,Vcells没有变化。
现在,为了 SETLENGTH对不相同的组长度进行分组具有此效果,我仍然在尝试理解 - 请查看上述编辑中的链接。
I'm seeing odd memory usage when using assignment by reference by group in a
data.table
. Here's a simple example to demonstrate (please excuse the triviality of the example):N <- 1e6 dt <- data.table(id=round(rnorm(N)), value=rnorm(N)) gc() for (i in seq(100)) { dt[, value := value+1, by="id"] } gc() tables()
which produces the following output:
> gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 303909 16.3 597831 32.0 407500 21.8 Vcells 2442853 18.7 3260814 24.9 2689450 20.6 > for (i in seq(100)) { + dt[, value := value+1, by="id"] + } > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 315907 16.9 597831 32.0 407500 21.8 Vcells 59966825 457.6 73320781 559.4 69633650 531.3 > tables() NAME NROW MB COLS KEY [1,] dt 1,000,000 16 id,value Total: 16MB
So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:
> rm(dt) > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 320888 17.2 597831 32 407500 21.8 Vcells 57977069 442.4 77066820 588 69633650 531.3 > tables() No objects of class data.table exist in .GlobalEnv
The memory leak seems to disappear when removing the by=... from the assignment:
> gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 312955 16.8 597831 32.0 467875 25.0 Vcells 2458890 18.8 3279586 25.1 2704448 20.7 > for (i in seq(100)) { + dt[, value := value+1] + } > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 322698 17.3 597831 32.0 467875 25.0 Vcells 2478772 19.0 5826337 44.5 5139567 39.3 > tables() NAME NROW MB COLS KEY [1,] dt 1,000,000 16 id,value Total: 16MB
To summarize, two questions:
- Am I missing something or is there a memory leak?
- If there is indeed a memory leak, can anyone suggest a workaround that lets me use assignment by reference by group without the memory leak?
For reference, here's the output of
sessionInfo()
:R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.10 loaded via a namespace (and not attached): [1] tools_3.0.2
解决方案UPDATE from Matt - Now fixed in v1.8.11. From NEWS :
Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered, #2648. Test added.
Many thanks to vc273, Y T and others.
From Arun ...
Why was this happening?
I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.
What's the reason for this to happen in
data.table
? This part is on identifying the portion of code fromdogroups.c
responsible for the apparent memory increase.Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.
The short explanation is that this seems to be an effect of the usage of
SETLENGTH
function (from R's C-interface) in data.table'sdogroups.c
.In
data.table
, when you useby=...
, for example,set.seed(45) DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6))) DT[, list(y=mean(x)), by=id]
Corresponding to
id=1
, the values of "x" (=c(1,2,1,1,2,3)
) has to be picked. This means, having to allocate memory for.SD
(all columns not inby
) perby
value.To overcome this allocation for each group in
by
,data.table
accomplishes this cleverly by first allocating.SD
with the length of the largest group inby
(which here is corresponding toid=1
, length 6). Then, we could, for each value ofid
, re-use the (overly) allocated data.table and by using the functionSETLENGTH
we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.But what seems strange is that when the number of elements for each group in
by
all have the same number of items, nothing special seems to be happening with regard togc()
output. However, when they aren't the same,gc()
seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.To illustrate this point, I've written a C-code that mimics the
SETLENGTH
function usage indogroups.c
in `data.table.// test.c #include <R.h> #define USE_RINTERNALS #include <Rinternals.h> #include <Rdefines.h> int sizes[100]; #define SIZEOF(x) sizes[TYPEOF(x)] // test function - no checks! SEXP test(SEXP vec, SEXP SD, SEXP lengths) { R_len_t i, j; char before_address[32], after_address[32]; SEXP tmp, ans; PROTECT(tmp = allocVector(INTSXP, 1)); PROTECT(ans = allocVector(STRSXP, 2)); snprintf(before_address, 32, "%p", (void *)SD); for (i=0; i<LENGTH(lengths); i++) { memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp)); SETLENGTH(SD, INTEGER(lengths)[i]); // do some computation here.. ex: mean(SD) } snprintf(after_address, 32, "%p", (void *)SD); SET_STRING_ELT(ans, 0, mkChar(before_address)); SET_STRING_ELT(ans, 1, mkChar(after_address)); UNPROTECT(2); return(ans); }
Here
vec
is equivalent to any data.tabledt
andSD
is equivalent to.SD
andlengths
is the length of each group. This is just a dummy program. Basically for each value oflengths
, sayn
, the firstn
elements are copied fromvec
on toSD
. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.Save this file as
test.c
and then compile it as follows from terminal:R CMD SHLIB -o test.so test.c
Now, open a new R-session, go to the path where
test.so
exists and then type:dyn.load("test.so") require(data.table) set.seed(45) max_len <- as.integer(1e6) lengths <- as.integer(sample(4:(max_len)/10, max_len/10)) gc() vec <- 1:max_len for (i in 1:100) { SD <- vec[1:max(lengths)] bla <- .Call("test", vec, SD, lengths) print(gc()) }
Note that for each
i
here,.SD
will be allocated a different memory location and that's being replicated here by assigningSD
for eachi
.By running this code, you'll find that 1) the two values returned are identical for each
i
to that ofaddress(SD)
and 2)Vcells used Mb
keeps increasing. Now, remove all variables from the workspace withrm(list=ls())
and then dogc()
, you'll find that not all memory is being restored/freed.Initial:
used (Mb) gc trigger (Mb) max used (Mb) Ncells 332708 17.8 597831 32.0 467875 25.0 Vcells 1033531 7.9 2327578 17.8 2313676 17.7
After 100 runs:
used (Mb) gc trigger (Mb) max used (Mb) Ncells 332912 17.8 597831 32.0 467875 25.0 Vcells 2631370 20.1 4202816 32.1 2765872 21.2
After
rm(list=ls())
andgc()
:used (Mb) gc trigger (Mb) max used (Mb) Ncells 341275 18.3 597831 32.0 467875 25.0 Vcells 2061531 15.8 4202816 32.1 3121469 23.9
If you remove the line
SETLENGTH(SD, ...)
from the C-code, and run it again, you'll find that there's no change in the Vcells.Now as to why SETLENGTH on grouping with non-identical group lengths has this effect,
I'm still trying to understand- check out the link in the edit above.这篇关于data.table中的内存泄漏通过引用分组分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!