警告:在向从函数返回的data.table添加列时检测到“.internal.selfref” [英] Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

查看:124
本文介绍了警告:在向从函数返回的data.table添加列时检测到“.internal.selfref”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这似乎是 fread 错误,但我不确定。



此示例重现了我的问题。我有一个函数,我读一个data.table并返回一个列表。我使用列表分组其他结果在相同的结构。这里我的代码:

  ff.fread<  -  function(){
dt = fread(x
1
2

list(dt = dt)
}

DT.f < - ff.fread

现在当我尝试向DT.f添加一个新列时,消息:

  DT.f [,y:= 1:2] 
警告消息:
`[.data.table`(DT.f,,`:=`(y,1:2)):
检测并通过获取整个
表的副本来检测和修复.internal.selfref所以:=可以通过引用添加这个新列。在早些时候,
这个数据表已经被R复制(或者使用
结构()或类似的手动创建)。避免键< - ,名称< - 和attr < - 在R当前
(奇怪)可以复制整个数据表。使用set *语法来避免
复制:?set,?setnames和?setattr。此外,在R 整个DT1和DT2(用于复制命名对象的R列表());请将
升级到R> = v3.1.0,如果这是咬人。如果此消息没有帮助,请报告到
datatable-help,以便根本原因可以修复。

请注意,如果我手动创建data.table我没有此警告。这工作正常,例如:

  ff<  -  function(){
list(dt = data.table x = 1:2))
}
DT <-ff()$ dt
DT [,y:= 1:2]

或者如果我不在列表中返回 fread 的结果,

  ff.fread<  -  function(){
dt = fread(x
1
2

dt
}


解决方案

这与 fread 无关,但是你正在调用 list c $ c>并传递一个命名对象。我们可以这样做:

  require(data.table)
DT< - data.table = 1:2)#命名对象'DT'
DT.l < - list(DT = DT)#创建包含一个数据表的列表
y-DT.l $ DT#回到data.table
y [,bla:= 1L]#现在添加引用
#工作正常,但会出现警告消息

DT.l = list(DT = data.table(x = 1:2))#DT =调用,而不是命名对象
y = DT.l $ DT
y [,bla:= 1L]
#没有警告消息



好消息:



好消息是,从R版> = 3.1.0(现在在devel),传递一个命名对象到 list()不再创建一个副本,而是,它的引用计数(指向此值的对象数)只是碰撞。



要了解 data.table 如何检测副本,请使用 .internal.selfref ,我们将深入介绍 data.table 的历史。



首先,一些历史:



你应该知道 data.table 在创建时分配列指针槽(truelength设置为默认值100),以便:= 可用于稍后通过引用添加列。这里有一个问题 - 处理副本。例如,当我们调用 list()并传递一个命名对象时,正在创建副本,如下所示。

  tracemem(DT)
#[1]< 0x7fe23ac3e6d0>
DT.list< - list(DT = DT)#`DT`是RHS上的命名对象= here
#tracemem [0x7fe23ac3e6d0 - 0x7fe23cd72f48]:

data.table R make(不 data.table copy())是R内部设置 truelength 参数为0,即使 truelength(。)函数仍将返回正确的结果。在通过引用:= 更新时,无意中导致了 segfault ,因为过度分配存在了(或至少不再被识别)。这发生在版本< 1.7.8。为了克服这个,引入了一个名为 .internal.selfref 的属性。您可以通过属性(DT)来检查此属性。



来自NEWS(v1.7.8):


崩溃是固定的。根本原因是 key< - 总是复制整个表。 < em>该副本的问题(除了慢)是R不保持过分配 truelength ,但看起来好像它已经 。 键<-在内部使用,特别是在 merge()中。因此,在 merge()之后使用:= 添加一列是一个内存覆盖, 键<-的副本。



data.tables 现在有一个新的属性 .internal.selfref 来捕获和警告这样的副本。 key <-的所有内部使用已替换为 setkey()或新函数




.internal.selfref do?



基本上只是指向自己。它只是一个附加到 DT 的属性,它包含 DT 的RAM中的地址。如果R无意中复制了 DT ,则 DT 的地址将在RAM中移动,但附加的属性将仍然包含旧的内存地址,他们将不再匹配。 data.table 检查它们在通过引用添加一个新列到备用列指针槽之前匹配(即有效)。



如何 .internal.selfref 实现?



为了理解此属性 .internal.selfref ,我们必须了解外部指针( EXTPTRSXP )是什么。 此页解释得很好。复制/粘贴基本行:


外部指针SEXP用于处理对C结构的引用,例如 ,并且例如在包RODBC中用于此目的。它们在复制语义中是不寻常的,因为当复制R对象时,外部指针对象不被复制。


它们创建为:

  SEXP R_MakeExternalPtr(void * p,SEXP tag,SEXP prot); 




其中p是指针(因此这不能是一个函数指针),以及tag和prot是对普通R对象的引用,这些对象在外部指针对象的生命周期中将保持存在(被保护免于垃圾收集)。一个有用的约定是对某种形式的类型标识使用标签字段,如果该内存是从R堆分配的,则使用prot字段来保护外部指针所表示的内存。




在我们的例子中,我们为DT创建属性 .internal.selfref ,其值是一个指向NULL的外部指针(你在属性值中看到的地址),这个外部指针的 prot 字段是另一个外部指针返回 DT

p> 注意:我们必须将此extptr用于其prot是extptr策略的NULL,以便相同(DT1,DT2)是两个不同的副本,但具有相同的内容返回TRUE。 (如果你不明白这是什么意思,你可以跳到下一部分,这与理解这个问题的答案无关)。





我们知道外部指针不会在复制期间复制。基本上,当我们创建一个data.table,属性.internal.selfref创建一个外部指针为NULL,它的 prot 字段创建一个外部指针返回 DT 。现在,当进行无意的复制时,对象的地址被修改,但不是被属性保护的地址。它仍然指向 DT 是否存在..因为它不会/不能被修改。因此,这通过检查当前对象的地址和由外部指针保护的地址来在内部检测。如果它们不匹配,那么R已经做出了复制(这将丢失data.table仔细创建的过度分配)。也就是:

  DT < -  data.table(x = 1:2)#internal selfref set 
DT .list < - list(DT = DT)#copy made,address(DT.list $ DT)!= address(DT)
#和长度将受到影响。

DT.new< - DT.list $ DT#DT.new!的地址= DT
的地址并且不等于
指向的地址#属性的'prot'外部指针

#所以重新分配必须由data.table在下一次更新
#参考,它警告,所以你可以修复根本原因不使用list(),
#key< - ,names< - 等。

这需要付出很多。我想我已经设法通过尽可能清楚。如果有任何错误(我花了一段时间将它包裹在我的头上)或进一步澄清的可能性,随意编辑或评论与您的​​建议。



希望这清除的东西。


This seems as fread bug, but I am not sure.

This example reproduce my problem. I have a function where I read a data.table and return it in a list. i use list to group other results in the same structure. Here my code:

ff.fread <- function(){
  dt = fread("x
1
2
")
  list(dt=dt)   
}

DT.f <- ff.fread()$dt

Now when I try to add a new column to DT.f, it works but I get a warning message:

DT.f[,y:=1:2]
Warning message:
In `[.data.table`(DT.f, , `:=`(y, 1:2)) :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole
  table so that := can add this new column by reference. At an earlier point,
  this data.table has been copied by R (or been created manually using
  structure() or similar). Avoid key<-, names<- and attr<- which in R currently
  (and oddly) may copy the whole data.table. Use set* syntax instead to avoid
  copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0, list(DT1,DT2) copied
  the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade
  to R>=v3.1.0 if that is biting. If this message doesn't help, please report to
  datatable-help so the root cause can be fixed.

Note the if I create the data.table manually I don't have this warning. This works fine for example:

ff <- function(){
      list(dt=data.table(x=1:2))
    }
DT <- ff()$dt
DT[,y:=1:2]

Or if I don't return the result of fread within a list , it works also fine

ff.fread <- function(){
  dt = fread("x
1
2
")
  dt
}

解决方案

This has nothing to do with fread per se, but that you're calling list() and passing it a named object. We can recreate this by doing:

require(data.table)
DT <- data.table(x=1:2)       # name the object 'DT'
DT.l <- list(DT=DT)           # create a list containing one data.table
y <- DT.l$DT                  # get back the data.table
y[, bla := 1L]                # now add by reference
# works fine but warning message will occur

DT.l = list(DT=data.table(x=1:2))   # DT = a call, not a named object
y = DT.l$DT
y[, bla:=1L]
# works fine and no warning message

Good news:

The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list() will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.

To understand how data.table detects copies using .internal.selfref, we'll dive into some history of data.table.

First, some history:

You should know that data.table over-allocates column pointer slots (truelength is set to a default of 100) on creation so that := can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list() and pass it a named object, a copy is being made, as illustrated below.

tracemem(DT)
# [1] "<0x7fe23ac3e6d0>"
DT.list <- list(DT=DT)    # `DT` is the named object on the RHS of = here
# tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]: 

The problem with any copy of data.table that R makes (not data.table's copy()) is that R internally sets the truelength parameter to 0 even though truelength(.) function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref was introduced. You can check this attribute by doing attributes(DT).

From NEWS (of v1.7.8):

o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy.

data.tables now have a new attribute .internal.selfref to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy.

What does this .internal.selfref do?

It just points to itself, basically. It's simply an attribute attached to DT that contains the address in RAM of DT. If R inadvertently copies DT, the address of DT will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.

How is .internal.selfref implemented ?

In order to understand this attribute .internal.selfref, we've to understand what an external pointer (EXTPTRSXP) is. This page explains nicely. Copy/pasting the essential lines:

External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.

They are created as:

SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);

where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.

In our case, we create the attribute .internal.selfref of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot field is another external pointer back to DT (hence referred to as selfref) with its prot set to NULL this time.

Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2) which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).

Okay, how does this all work then?

We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot field creating an external pointer back to DT. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:

DT <- data.table(x=1:2) # internal selfref set
DT.list <- list(DT=DT)  # copy made, address(DT.list$DT) != address(DT)
                        # and truelength would be affected.

DT.new <- DT.list$DT    # address of DT.new != address of DT
                        # and it's not equal to the address pointed to by
                        # the attribute's 'prot' external pointer

# so a re-over-allocation has to be made by data.table at the next update by
# reference, and it warns so you can fix the root cause by not using list(),
# key<-, names<- etc.

That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.

Hope this clears up things.

这篇关于警告:在向从函数返回的data.table添加列时检测到“.internal.selfref”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆