data.table中的内存泄漏通过引用分组分配 [英] Memory leak in data.table grouped assignment by reference

查看:93
本文介绍了data.table中的内存泄漏通过引用分组分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 中,按组引用使用赋值时,我看到奇怪的内存使用情况。这里有一个简单的例子来证明(请原谅这个例子的琐碎):

  N < -  1e6 
dt < - data.table(id = round(rnorm(N)),value = rnorm(N))

gc()
for(i in seq b $ b dt [,value:= value + 1,by =id]
}
gc()
tables()
/ pre>

其产生以下输出:

  gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for(i in seq(100)){
+ dt [,value:= value + 1,by =id]
+}
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8
Vcells 59966825 457.6 73320781 559.4 69633650 531.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
总计:16MB

在循环后添加了大约440MB的已使用Vcell内存。从内存中删除data.table后,不会考虑此内存:

 > rm(dt)
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 320888 17.2 597831 32 407500 21.8
Vcells 57977069 442.4 77066820 588 69633650 531.3
> tables()
在.GlobalEnv中没有类data.table的对象

内存泄漏从作业中删除by = ...时似乎消失了:

 > gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 312955 16.8 597831 32.0 467875 25.0
Vcells 2458890 18.8 3279586 25.1 2704448 20.7
> for(i in seq(100)){
+ dt [,value:= value + 1]
+}
> gc()
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 322698 17.3 597831 32.0 467875 25.0
Vcells 2478772 19.0 5826337 44.5 5139567 39.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
总计:16MB

总结一下,有两个问题:


  1. 我缺少一些东西, ?

  2. 如果确实有内存泄漏,任何人都可以建议一个解决方法,让我使用组的引用分配,而不会发生内存泄漏?

作为参考,下面是 sessionInfo()的输出:

  R版本3.0.2(2013-09-25)
平台:x86_64-pc-linux-gnu(64位)

语言环境:
[1] LC_CTYPE = en_US.UTF-8 LC_NUMERIC = C LC_TIME = en_US.UTF-8 LC_COLLATE = en_US.UTF-8 LC_MONETARY = zh_US.UTF-8
[6] LC_MESSAGES = zh_US.UTF -8 LC_PAPER = en_US.UTF-8 LC_NAME = C LC_ADDRESS = C LC_TELEPHONE = C
[11] LC_MEASUREMENT = en_US.UTF-8 LC_IDENTIFICATION = C

附加基本套件:
[1] stats graphics grDevices utils数据集方法base

其他附加包:
[1] data.table_1.8.10

通过命名空间加载附件):
[1] tools_3.0.2


解决方案

来自Matt的更新 - 现在固定在v1.8.11。从新闻


分组固定的长时间未完成(通常很小)的内存泄漏。当
最后一组小于最大组时,
那些尺寸的差异没有被释放。大多数用户运行分组查询
一次,并且永远不会注意到,但任何循环调用
(如并行运行或基准测试)的调用可能遭受了
#2648。测试已添加。



非常感谢vc273,YT和其他人。




<




来自Arun ...



为什么会发生?



我希望我遇到了 此帖子 ,然后再讨论此问题。然而,一个很好的学习经验。 Simon Urbanek非常简洁地总结了这个问题,它不是内存泄漏,而是使用/释放内存的错误报告。 我感觉这是发生了什么。






data.table ?这部分是从 dogroups.c 识别代码的部分,负责显着的记忆增加。



所以经过一些冗长的测试,我想我已经设法至少找到原因是这种情况发生。希望有人可以帮助我从这个职位。我的结论是,这是内存泄漏。



简短的解释是,这似乎是一个使用 SETLENGTH 函数(来自R的C接口)在data.table的 dogroups.c 中的效果。当您使用 by = ... 时,

data.table c>,例如

  set.seed(45)
DT< - data.table(x =样本(3,12,TRUE),id = rep(3:1,c(2,4,6)))
DT [,list(y = mean(x)),by = id]

对应于 id = 1 x( = c(1,2,1,1,2,3))。这意味着必须为 .SD 中的所有列不在)分配内存, $

$ <$> $ <$> $ <$> $ <$> data.table 通过首先分配 .SD 中最大的组的长度来巧妙地实现(这里对应于 id = 1 ,长度6)。然后,对于 id 的每个值,重用(过分)分配的data.table并使用函数 SETLENGTH 我们可以将长度调整为当前组的长度。注意,通过这样做,除了只分配给最大的组的一次,这里没有实际分配的内存。



但是奇怪的是, by 中的每个组都具有相同数量的项,对于 gc()输出。然而,当它们不相同时, gc()似乎报告Vcell中使用量的增加。这是尽管在这两种情况下没有分配额外的存储器的事实。



为了说明这一点,我编写了一个C模拟 SETLENGTH

 中的 dogroups.c  > // test.c 
#include< R.h>
#define USE_RINTERNALS
#include< Rinternals.h>
#include< Rdefines.h>

int sizes [100];
#define SIZEOF(x)sizes [TYPEOF(x)]

//测试函数 - 没有检查!
SEXP测试(SEXP vec,SEXP SD,SEXP长度)
{
R_len_t i,j;
char before_address [32],after_address [32];
SEXP tmp,ans;
PROTECT(tmp = allocVector(INTSXP,1));
PROTECT(ans = allocVector(STRSXP,2));
snprintf(before_address,32,%p,(void *)SD);
for(i = 0; i memcpy((char *)DATAPTR(SD),(char *)DATAPTR(vec),INTEGER ] * SIZEOF(tmp));
SETLENGTH(SD,INTEGER(lengths)[i]);
//在这里做一些计算。例如:mean(SD)
}
snprintf(after_address,32,%p,(void *)SD);
SET_STRING_ELT(ans,0,mkChar(before_address));
SET_STRING_ELT(ans,1,mkChar(after_address));
UNPROTECT(2);
return(ans);
}

这里 vec 等价于任何data.table dt SD 等效于 .SD lengths 是每个组的长度。这只是一个虚拟程序。基本上对于长度的每个值,例如 n ,第一个 n 元素从 vec 复制到 SD 。然后可以计算任何人想要的SD(这是不是在这里做)。对于我们的目的,将返回使用SETLENGTH操作前后的SD的地址,以说明SETLENGTH没有进行复制。



将此文件另存为 test.c ,然后从终端编译如下:

  R CMD SHLIB -o test.so test.c 

现在,打开一个新的会话,转到路径其中 test.so 存在,然后键入:

  dyn.load test.so)
require(data.table)
set.seed(45)
max_len< - as.integer(1e6)
lengths& integer(sample(4:(max_len)/ 10,max_len / 10))
gc()
vec < - 1:max_len
for(i in 1:100){
SD <-vec [1:max(lengths)]
bla < - .Call(test,vec,SD,lengths)
print(gc())
}

请注意,对于每个 i .SD 将被分配不同的内存位置,并通过为每个 SD > i 。



运行此代码,您会发现1)每个 i 返回的两个值是相同的地址(SD)和2) Vcell使用的Mb 不断增加。现在,使用 rm(list = ls())删除工作区中的所有变量,然后执行 gc()



初始:

 

code> used(Mb)gc trigger(Mb)max used(Mb)
Ncells 332708 17.8 597831 32.0 467875 25.0
Vcells 1033531 7.9 2327578 17.8 2313676 17.7

100次运行后:

  Mb)gc trigger(Mb)max(Mb)
Ncells 332912 17.8 597831 32.0 467875 25.0
Vcells 2631370 20.1 4202816 32.1 2765872 21.2

rm(list = ls()) gc()

  used(Mb)gc trigger(Mb)max used(Mb)
Ncells 341275 18.3 597831 32.0 467875 25.0
Vcells 2061531 15.8 4202816 32.1 3121469 23.9

如果删除 SETLENGTH(SD,...)从C代码,并再次运行,你会发现,Vcells没有变化。



现在,为了 SETLENGTH对不相同的组长度进行分组具有此效果,我仍然在尝试理解 - 请查看上述编辑中的链接。


I'm seeing odd memory usage when using assignment by reference by group in a data.table. Here's a simple example to demonstrate (please excuse the triviality of the example):

N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))

gc()
for (i in seq(100)) {
  dt[, value := value+1, by="id"]
}
gc()
tables()

which produces the following output:

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  303909 16.3     597831 32.0   407500 21.8
Vcells 2442853 18.7    3260814 24.9  2689450 20.6
> for (i in seq(100)) {
  +   dt[, value := value+1, by="id"]
  + }
> gc()
used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   315907  16.9     597831  32.0   407500  21.8
Vcells 59966825 457.6   73320781 559.4 69633650 531.3
> tables()
NAME      NROW MB COLS     KEY
[1,] dt   1,000,000 16 id,value    
Total: 16MB

So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:

> rm(dt)
> gc()
used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells   320888  17.2     597831   32   407500  21.8
Vcells 57977069 442.4   77066820  588 69633650 531.3
> tables()
No objects of class data.table exist in .GlobalEnv

The memory leak seems to disappear when removing the by=... from the assignment:

>     gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  312955 16.8     597831 32.0   467875 25.0
Vcells 2458890 18.8    3279586 25.1  2704448 20.7
>     for (i in seq(100)) {
  +       dt[, value := value+1]
  +     }
>     gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells  322698 17.3     597831 32.0   467875 25.0
Vcells 2478772 19.0    5826337 44.5  5139567 39.3
>     tables()
NAME      NROW MB COLS     KEY
[1,] dt   1,000,000 16 id,value    
Total: 16MB

To summarize, two questions:

  1. Am I missing something or is there a memory leak?
  2. If there is indeed a memory leak, can anyone suggest a workaround that lets me use assignment by reference by group without the memory leak?

For reference, here's the output of sessionInfo():

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] data.table_1.8.10

loaded via a namespace (and not attached):
  [1] tools_3.0.2

解决方案

UPDATE from Matt - Now fixed in v1.8.11. From NEWS :

Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered, #2648. Test added.

Many thanks to vc273, Y T and others.



From Arun ...

Why was this happening?

I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.


What's the reason for this to happen in data.table? This part is on identifying the portion of code from dogroups.c responsible for the apparent memory increase.

Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.

The short explanation is that this seems to be an effect of the usage of SETLENGTH function (from R's C-interface) in data.table's dogroups.c .

In data.table, when you use by=..., for example,

set.seed(45)
DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6)))
DT[, list(y=mean(x)), by=id]

Corresponding to id=1, the values of "x" (=c(1,2,1,1,2,3)) has to be picked. This means, having to allocate memory for .SD (all columns not in by) per by value.

To overcome this allocation for each group in by, data.table accomplishes this cleverly by first allocating .SD with the length of the largest group in by (which here is corresponding to id=1, length 6). Then, we could, for each value of id, re-use the (overly) allocated data.table and by using the function SETLENGTH we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.

But what seems strange is that when the number of elements for each group in by all have the same number of items, nothing special seems to be happening with regard to gc() output. However, when they aren't the same, gc() seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.

To illustrate this point, I've written a C-code that mimics the SETLENGTH function usage in dogroups.c in `data.table.

// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>

int sizes[100];
#define SIZEOF(x) sizes[TYPEOF(x)]

// test function - no checks!
SEXP test(SEXP vec, SEXP SD, SEXP lengths)
{
    R_len_t i, j;
    char before_address[32], after_address[32];
    SEXP tmp, ans;
    PROTECT(tmp = allocVector(INTSXP, 1));
    PROTECT(ans = allocVector(STRSXP, 2));
    snprintf(before_address, 32, "%p", (void *)SD);
    for (i=0; i<LENGTH(lengths); i++) {
        memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp));
        SETLENGTH(SD, INTEGER(lengths)[i]);
        // do some computation here.. ex: mean(SD)
    }
    snprintf(after_address, 32, "%p", (void *)SD);
    SET_STRING_ELT(ans, 0, mkChar(before_address));
    SET_STRING_ELT(ans, 1, mkChar(after_address));
    UNPROTECT(2);
    return(ans);
}

Here vec is equivalent to any data.table dt and SD is equivalent to .SD and lengths is the length of each group. This is just a dummy program. Basically for each value of lengths, say n, the first n elements are copied from vec on to SD. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.

Save this file as test.c and then compile it as follows from terminal:

R CMD SHLIB -o test.so test.c

Now, open a new R-session, go to the path where test.so exists and then type:

dyn.load("test.so")
require(data.table)
set.seed(45)
max_len <- as.integer(1e6)
lengths <- as.integer(sample(4:(max_len)/10, max_len/10))
gc()
vec <- 1:max_len
for (i in 1:100) {
    SD <- vec[1:max(lengths)]
    bla <- .Call("test", vec, SD, lengths)
    print(gc())
}

Note that for each i here, .SD will be allocated a different memory location and that's being replicated here by assigning SD for each i.

By running this code, you'll find that 1) the two values returned are identical for each i to that of address(SD) and 2) Vcells used Mb keeps increasing. Now, remove all variables from the workspace with rm(list=ls()) and then do gc(), you'll find that not all memory is being restored/freed.

Initial:

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  332708 17.8     597831 32.0   467875 25.0
Vcells 1033531  7.9    2327578 17.8  2313676 17.7

After 100 runs:

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  332912 17.8     597831 32.0   467875 25.0
Vcells 2631370 20.1    4202816 32.1  2765872 21.2

After rm(list=ls()) and gc():

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  341275 18.3     597831 32.0   467875 25.0
Vcells 2061531 15.8    4202816 32.1  3121469 23.9

If you remove the line SETLENGTH(SD, ...) from the C-code, and run it again, you'll find that there's no change in the Vcells.

Now as to why SETLENGTH on grouping with non-identical group lengths has this effect, I'm still trying to understand - check out the link in the edit above.

这篇关于data.table中的内存泄漏通过引用分组分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆