了解data.table无效.selfref警告 [英] Understanding data.table invalid .selfref warning

查看:127
本文介绍了了解data.table无效.selfref警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出data.table的invalid .selfref错误,我得到下面的代码。

I am trying to figuring out the data.table 'invalid .selfref' error that I am getting with the code below.

library(data.table) 
library(dplyr)
DT <- data.table(aa=1:100, bb=rnorm(n=100), dd=gl(2,100))
DT <- DT %.% group_by(dd, aa) %.% summarize(m=mean(bb))
DT <- DT[, ee := 3]

最后一行会抛出错误。这里有建议只写最后一行为 DT $ ee< - 3 但没有真正解释为什么它的工作原理(和 := 没有),并且是初学者data.table用户也不喜欢正确的data.table惯用法。

The last line throws the error. Here there is the suggestion to just write the last line as DT$ee <- 3 but doesn't really explain why it works (and the := doesn't) and being a beginner data.table user also doesn't feel like the proper data.table idiom.

IS与dplyr行相关,显然改变了DT数据表。但是当我改变那行(和那些后面的)到 DDT < - DT%。%group_by()... 然后我仍然得到selfref错误从 DT [,ee:= 3] 行。

It IS related to the dplyr line in there that obviously changes the DT data table. But when I change that line (and those following) into DDT <- DT %.% group_by() ... then I still get the selfref error from the DT[, ee := 3] line.

检查各种来源,

R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] graphics  grDevices utils     datasets  stats     methods   base     

other attached packages:
[1] dplyr_0.2        data.table_1.9.2 ggplot2_1.0.0   

loaded via a namespace (and not attached):
 [1] assertthat_0.1   colorspace_1.2-4 digest_0.6.4     grid_3.1.0      
 [5] gtable_0.1.2     MASS_7.3-31      munsell_0.4.2    parallel_3.1.0  
 [9] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
[13] scales_0.2.4     stringr_0.6.2    tools_3.1.0     


推荐答案

我只是运行你的代码,我看到的问题。 data.table 过度分配列指针的向量(用于稍后通过引用有效地添加列),并且当操作(很可能无意地)删除了过分配时,发出此警告。

I just ran your code, and I see the problem. data.table over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.

让我尝试使用 slide 45 ,来自Matt的 useR 2014 演示文稿。顶部的(蓝色和黄色)框对应于列指针的向量,箭头显示每个指针指向的数据。

Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.

左边的图形描绘了如何将列添加到<$ c $的列(或 cbind c> data.frame 工作。 cbind 基本上导致一个(深或浅)副本导致列指针的向量的一个新位置(以黄色显示)和数据一个列)。

The figure on the left depicts pictorially how adding (or cbinding) a column to a data.frame works. cbinding a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).

右边的图显示了 data.table 方式, 3个蓝色框开始,由于过度分配,而 data.table 创建。通过使用:= ,甚至不会进行浅拷贝

The figure on the right shows the data.table way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table creation. And by using :=, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.

这是关于差异的,而且

现在警告告诉你,你做的任何操作已经移除了这个 over-allocation / em> - 表示额外的蓝框已经走了!因此,我们不能再通过引用添加列,直到我们再次 重新分配(这是不必要的,应该避免,但由于已经过去了,我们做什么下一个最好的东西)。

Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).

我的猜测是你的 dplyr 语法以某种方式删除这个超额分配, int在下一步使用:= data.table 时再次重新分配以添加新列引用(其将导致浅拷贝)。

My guess is that your dplyr syntax somehow removes this over-allocation which is caught int the next step when you use := and data.table over-allocates once again before to add new column by reference (which'll result in a shallow copy).

如果我这样做 data.table

DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]

它工作正常。

我现在没有时间查看 dplyr

I don't have the time to look into dplyr right now to verify or find out what's doing this.

这篇关于了解data.table无效.selfref警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆