了解data.table是什么时候是另一个data.table的(对的副本)的引用 [英] Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

查看:105
本文介绍了了解data.table是什么时候是另一个data.table的(对的副本)的引用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对理解 data.table 的传递引用属性有点麻烦。



在创建 data.table

时, code>从 data.table (通过< - ),然后更新:= ,原始表也被改变。这是预期的,如:



? data.table :: copy
stackoverflow:pass-by-reference-the-operator-in-the-data-table-package



  library(data.table)

DT< - data.table(a = c 1,2),b = c(11,12))
print(DT)
#ab
#[1,] 1 11
#[2,

newDT < - DT#reference,not copy
newDT [1,a:= 100]#修改新的DT

打印修改过。
#ab
#[1,] 100 11
#[2,] 2 12

但是,如果我在 < - 之间插入一个非< - > := $ c>分配和:= 行,现在不再修改 DT

  DT = data.table(a = c(1,2),b = c(11,12))
newDT < DT
newDT $ b [2] < - 200#new operation
newDT [1,a:= 100]

print(DT)
#ab
#[1,] 1 11
#[2,] 2 12

所以看来 newDT $ b [2]< - 200 行不知何故打破了引用。我想这个调用一个副本不知何故,但我想完全了解R是如何处理这些操作,以确保我不会在我的代码中引入潜在的错误。


解决方案

是的,它是R中的子分配使用< - (或 = - > )制作整个对象的副本。您可以使用 tracemem(DT) .Internal(inspect(DT))跟踪,如下所示。 data.table 特征:= set()通过引用分配它们通过的任何对象。因此,如果该对象以前被复制(通过分配 < - 或显式的复制(DT)),

  DT < -  data.table(a = c(1,2),b = c(11,12))
newDT <-DT

。内部(检查(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM ,ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#..snip ..

。同一对象
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM (len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#.. snip ..

tracemem(newDT)
#[1]< 0x0000000003b7e2a0

newDT $ b [2]< - 200
#tracemem [0000000003B7E2A0 - > 00000000040ED948]:
#tracemem [00000000040ED948 - > 00000000040ED830]:.Call copy $< - 。data.table $< -

.Internal(inspect(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM ),TR,ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#..snip ..


#@ 0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2,tl = 100)
#@ 00000000040ED7F8 14 REALSXP g0c2 [NAM(2)](len = tl = 0)1,2
#@ 00000000040ED8D8 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,200
#ATTRIB:#..snip ..
注意如何复制 a 向量(不同的十六进制值)。



表示向量的新副本),即使 a 未更改。甚至整个 b 被复制,而不是仅仅改变需要改变的元素。这对于避免大数据很重要,为什么:= set() > data.table



现在,使用我们复制的 newDT 通过引用:

  newDT 
#ab
#[1,] 1 11
# [2,] 2 200

newDT [2,b:= 400]
#ab#查看常见问题2.21为什么打印newDT
#[1,] 1 11
#[2,] 2 400

.Internal(inspect(newDT))
#@ 0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2 ,tl = 100)
#@ 00000000040ED7F8 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040ED8D8 14 REALSXP g0c2 [NAM len = 2,tl = 0)11,400
#ATTRIB:#..snip ..

请注意,所有3个十六进制值(列点的向量和2列中的每一列)保持不变。



或者,我们可以通过修改原来的 DT 参考:

  DT [2,b:= 600] 
#ab
#[1,] 1 11
#[2,] 2 600

。内部(检查(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM )](len = 2,tl = 0)11,600
#ATTRIB:#..snip ..


$ b b

这些十六进制值与我们在上面看到的 DT 的原始值相同。使用 tracemem 并与 data.frame 比较的更多示例,请键入 example(copy)



Btw,如果 tracemem(DT) [2,b:= 600] ,您就会看到一份报告。这是 print 方法的前10行的副本。当用 invisible()包装时,或者在函数或脚本中调用时,不会调用 print 方法。



这一切都适用于函数内部;即:= set()不要在写时复制,即使在函数内。如果需要修改本地副本,则在函数开始时调用 x = copy(x)。但是,请记住 data.table 用于大数据(以及对小数据更快的编程优势)。我们故意不想复制大对象(永远)。因此,我们不需要允许通常的3 *工作记忆因素经验法则。我们试图只需要工作记忆大到一列(即工作记忆因子为1 / ncol而不是3)。


I'm having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to 'break' the reference, and I'd like to understand exactly what's happening.

On creating a data.table from another data.table (via <-, then updating the new table by :=, the original table is also altered. This is expected, as per:

?data.table::copy and stackoverflow: pass-by-reference-the-operator-in-the-data-table-package

Here's an example:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

However, if I insert a non-:= based modification between the <- assignment and the := lines above, DT is now no longer modified:

DT = data.table(a=c(1,2), b=c(11,12))
newDT <- DT        
newDT$b[2] <- 200  # new operation
newDT[1, a := 100]

print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

So it seems that the newDT$b[2] <- 200 line somehow 'breaks' the reference. I'd guess that this invokes a copy somehow, but I would like to understand fully how R is treating these operations, to ensure I don't introduce potential bugs in my code.

I'd very much appreciate if someone could explain this to me.

解决方案

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

这篇关于了解data.table是什么时候是另一个data.table的(对的副本)的引用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆