警告:将列添加到从函数返回的 data.table 时“检测到无效的 .internal.selfref" [英] Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

查看:8
本文介绍了警告:将列添加到从函数返回的 data.table 时“检测到无效的 .internal.selfref"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这似乎是 fread 错误,但我不确定.

这个例子重现了我的问题.我有一个函数,我在其中读取 data.table 并将其返回到列表中.我使用列表将其他结果分组到相同的结构中.这是我的代码:

ff.fread <- function(){dt = fread("x12")列表(dt=dt)}DT.f <- ff.fread()$dt

现在,当我尝试向 DT.f 添加新列时,它可以工作,但我收到一条警告消息:

DT.f[,y:=1:2]警告信息:在 `[.data.table`(DT.f, , `:=`(y, 1:2)) 中:通过获取整个副本检测并修复无效的 .internal.selfref表,以便 := 可以通过引用添加此新列.在更早的时候,此 data.table 已由 R 复制(或使用手动创建结构()或类似的).避免目前在 R 中的 key<-、names<- 和 attr<-(奇怪的是)可能会复制整个 data.table.改用 set* 语法来避免复制:?set、?setnames 和?setattr.此外,在 R=v3.1.0 如果那是咬人的.如果此消息没有帮助,请报告给数据表帮助,因此可以修复根本原因.

请注意,如果我手动创建 data.table,我没有此警告.这很好用,例如:

ff <- function(){列表(dt=data.table(x=1:2))}DT <- ff()$dtDT[,y:=1:2]

或者如果我没有在列表中返回 fread 的结果,它也可以正常工作

ff.fread <- function(){dt = fread("x12")dt}

解决方案

这与 fread 本身无关,而是你调用 list()并传递给它一个命名对象.我们可以通过这样做来重新创建它:

require(data.table)DT <- data.table(x=1:2) # 将对象命名为 'DT'DT.l <- list(DT=DT) # 创建一个包含一个data.table的列表y <- DT.l$DT # 取回data.tabley[, bla := 1L] # 现在通过引用添加# 工作正常,但会出现警告消息DT.l = list(DT=data.table(x=1:2)) # DT = 调用,不是命名对象y = DT.l$DTy[, bla:=1L]# 工作正常,没有警告信息

好消息:

好消息是,从 R 版本 >= 3.1.0(现在处于开发阶段)开始,将命名对象传递给 list()不再创建一个相反,它的引用计数(指向该值的对象数)只是被颠倒了.所以,这个问题随着 R 的下一个版本消失了.

要了解 data.table 如何使用 .internal.selfref 检测副本,我们将深入了解 data.table 的一些历史.

首先,一些历史:

您应该知道 data.table 在创建时会过度分配列指针槽(truelength 设置为默认值 100),以便 := 可用于稍后通过引用添加列.这样有一个问题 - 处理副本.例如,当我们调用 list() 并向它传递一个命名对象时,正在制作一个副本,如下所示.

tracemem(DT)# [1] <0x7fe23ac3e6d0>"DT.list <- list(DT=DT) # `DT`是这里=的RHS上的命名对象#tracemem[0x7fe23ac3e6d0 ->0x7fe23cd72f48]:

R 制作的任何 data.table 副本(不是 data.tablecopy())的问题在于 R在内部将 truelength 参数设置为 0,即使 truelength(.) 函数仍将返回正确的结果.这在使用 := 引用更新时无意中导致了 segfault,因为 over-allocation 不再存在(或至少不再被识别).这发生在版本 <1.7.8.为了克服这个问题,引入了一个名为 .internal.selfref 的属性.您可以通过执行 attributes(DT) 来检查此属性.

来自新闻(v1.7.8):

<块引用>

o克里斯崩溃"已修复.根本原因是 key<- 总是复制整个表.该副本的问题(除了速度较慢)是 R 没有维护过度分配的 truelength,但它看起来好像有.key<- 在内部使用,特别是在 merge() 中.因此,在 merge() 之后使用 := 添加一列是内存覆盖,因为在 key<-的副本.

data.tables 现在有一个新属性 .internal.selfref 可以在将来捕获和警告此类副本.key<- 的所有内部使用都已替换为 setkey() 或接受向量的新函数 setkeyv(),并且不复制.

这个.internal.selfref有什么作用?

基本上,它只是指向自己.它只是一个附加到 DT 的属性,包含 DT 在 RAM 中的地址.如果 R 不小心复制了 DTDT 的地址将在 RAM 中移动,但附加的属性仍将包含旧的内存地址,它们将不再匹配.data.table 在通过引用将新列添加到备用列指针槽之前检查它们是否匹配(即有效).

.internal.selfref是如何实现的?

为了理解.internal.selfref这个属性,我们必须了解外部指针(EXTPTRSXP)是什么.这个页面解释得很好.复制/粘贴基本行:

<块引用>

外部指针 SEXP 旨在处理对 C 结构(例如 handles)的引用,例如在 RODBC 包中用于此目的.它们的复制语义不同寻常,因为当复制 R 对象时,外部指针对象不会被复制.

它们被创建为:

SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);

<块引用>

其中 p 是指针(因此不能移植为函数指针),而 tag 和 prot 是对普通 R 对象的引用,它们将在外部指针对象的生命周期内保持存在(防止垃圾收集).一个有用的约定是使用 tag 字段进行某种形式的类型标识,并使用 prot 字段来保护外部指针表示的内存,如果该内存是从 R 堆分配的.

在我们的例子中,我们为DT创建属性.internal.selfref,它的值是一个指向NULL的外部指针(你在属性值中看到的地址)和这个外部指针指针的 prot 字段是另一个指向 DT 的外部指针(因此称为 selfref),其 prot 设置为这次为空.

注意:我们必须将此 extptr 用于 NULL,其 'prot' 是一个 extptr 策略,以便 identical(DT1, DT2) 是两个不同的副本,但相同的内容返回 TRUE.(如果你不明白这意味着什么,你可以直接跳到下一部分.这与理解这个问题的答案无关).

好的,那么这一切是如何工作的呢?

我们知道外部指针不会在复制过程中重复.基本上,当我们创建一个 data.table 时,属性 .internal.selfref 创建一个指向 NULL 的外部指针,它的 prot 字段创建一个返回 DT 的外部指针.现在,当无意的复制"正在制作中,对象的地址被修改,但不是受属性保护的地址.它仍然指向 DT 是否存在..因为它不会/不能被修改.因此,这是通过检查当前对象的地址和受外部指针保护的地址在内部检测到的.如果它们不匹配,则复制"已由 R 制作(这将丢失 data.table 精心创建的过度分配).那就是:

DT <- data.table(x=1:2) # 内部自引用集DT.list <- list(DT=DT) # 复制,地址(DT.list$DT) != 地址(DT)# 和 truelength 会受到影响.DT.new <- DT.list$DT # DT.new 的地址 != DT 的地址# 并且不等于指向的地址# 属性的'prot'外部指针# 所以 data.table 必须在下一次更新时重新分配# 参考,它会发出警告,因此您可以通过不使用 list() 来修复根本原因,# 键<-,名称<- 等.

要考虑的内容很多.我想我已经尽可能清楚地完成了.如果有任何错误(我花了一些时间才把它包裹在我的脑海中)或有进一步澄清的可能性,请随时编辑或评论您的建议.

希望这能解决问题.

This seems as fread bug, but I am not sure.

This example reproduce my problem. I have a function where I read a data.table and return it in a list. i use list to group other results in the same structure. Here my code:

ff.fread <- function(){
  dt = fread("x
1
2
")
  list(dt=dt)   
}

DT.f <- ff.fread()$dt

Now when I try to add a new column to DT.f, it works but I get a warning message:

DT.f[,y:=1:2]
Warning message:
In `[.data.table`(DT.f, , `:=`(y, 1:2)) :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole
  table so that := can add this new column by reference. At an earlier point,
  this data.table has been copied by R (or been created manually using
  structure() or similar). Avoid key<-, names<- and attr<- which in R currently
  (and oddly) may copy the whole data.table. Use set* syntax instead to avoid
  copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0, list(DT1,DT2) copied
  the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade
  to R>=v3.1.0 if that is biting. If this message doesn't help, please report to
  datatable-help so the root cause can be fixed.

Note the if I create the data.table manually I don't have this warning. This works fine for example:

ff <- function(){
      list(dt=data.table(x=1:2))
    }
DT <- ff()$dt
DT[,y:=1:2]

Or if I don't return the result of fread within a list , it works also fine

ff.fread <- function(){
  dt = fread("x
1
2
")
  dt
}

解决方案

This has nothing to do with fread per se, but that you're calling list() and passing it a named object. We can recreate this by doing:

require(data.table)
DT <- data.table(x=1:2)       # name the object 'DT'
DT.l <- list(DT=DT)           # create a list containing one data.table
y <- DT.l$DT                  # get back the data.table
y[, bla := 1L]                # now add by reference
# works fine but warning message will occur

DT.l = list(DT=data.table(x=1:2))   # DT = a call, not a named object
y = DT.l$DT
y[, bla:=1L]
# works fine and no warning message

Good news:

The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list() will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.

To understand how data.table detects copies using .internal.selfref, we'll dive into some history of data.table.

First, some history:

You should know that data.table over-allocates column pointer slots (truelength is set to a default of 100) on creation so that := can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list() and pass it a named object, a copy is being made, as illustrated below.

tracemem(DT)
# [1] "<0x7fe23ac3e6d0>"
DT.list <- list(DT=DT)    # `DT` is the named object on the RHS of = here
# tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]: 

The problem with any copy of data.table that R makes (not data.table's copy()) is that R internally sets the truelength parameter to 0 even though truelength(.) function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref was introduced. You can check this attribute by doing attributes(DT).

From NEWS (of v1.7.8):

o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy.

data.tables now have a new attribute .internal.selfref to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy.

What does this .internal.selfref do?

It just points to itself, basically. It's simply an attribute attached to DT that contains the address in RAM of DT. If R inadvertently copies DT, the address of DT will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.

How is .internal.selfref implemented ?

In order to understand this attribute .internal.selfref, we've to understand what an external pointer (EXTPTRSXP) is. This page explains nicely. Copy/pasting the essential lines:

External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.

They are created as:

SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);

where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.

In our case, we create the attribute .internal.selfref of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot field is another external pointer back to DT (hence referred to as selfref) with its prot set to NULL this time.

Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2) which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).

Okay, how does this all work then?

We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot field creating an external pointer back to DT. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:

DT <- data.table(x=1:2) # internal selfref set
DT.list <- list(DT=DT)  # copy made, address(DT.list$DT) != address(DT)
                        # and truelength would be affected.

DT.new <- DT.list$DT    # address of DT.new != address of DT
                        # and it's not equal to the address pointed to by
                        # the attribute's 'prot' external pointer

# so a re-over-allocation has to be made by data.table at the next update by
# reference, and it warns so you can fix the root cause by not using list(),
# key<-, names<- etc.

That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.

Hope this clears up things.

这篇关于警告:将列添加到从函数返回的 data.table 时“检测到无效的 .internal.selfref"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆