准确了解 data.table 何时是对另一个 data.table 的引用(相对于副本) [英] Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

查看:25
本文介绍了准确了解 data.table 何时是对另一个 data.table 的引用(相对于副本)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解 data.table 的传递引用属性时遇到了一些麻烦.某些操作似乎破坏"了引用,我想确切了解发生了什么.

I'm having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to 'break' the reference, and I'd like to understand exactly what's happening.

从另一个data.table创建data.table(通过<-,然后通过更新新表:=, 原始表也被改变了.这是预期的,如:

On creating a data.table from another data.table (via <-, then updating the new table by :=, the original table is also altered. This is expected, as per:

?data.table::copystackoverflow:pass-by-reference-the-operator-in-the-data-table-package

这是一个例子:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

但是,如果我在 <- 赋值和上面的 := 行之间插入一个基于非 := 的修改,DT 现在不再修改:

However, if I insert a non-:= based modification between the <- assignment and the := lines above, DT is now no longer modified:

DT = data.table(a=c(1,2), b=c(11,12))
newDT <- DT        
newDT$b[2] <- 200  # new operation
newDT[1, a := 100]

print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

所以看起来 newDT$b[2] <- 200 行以某种方式破坏"了引用.我猜这会以某种方式调用一个副本,但我想完全了解 R 如何处理这些操作,以确保我不会在我的代码中引入潜在的错误.

So it seems that the newDT$b[2] <- 200 line somehow 'breaks' the reference. I'd guess that this invokes a copy somehow, but I would like to understand fully how R is treating these operations, to ensure I don't introduce potential bugs in my code.

如果有人能向我解释这一点,我将不胜感激.

I'd very much appreciate if someone could explain this to me.

推荐答案

是的,它是在 R 中使用 <-(或 =->;) 复制整个对象.您可以使用 tracemem(DT).Internal(inspect(DT)) 跟踪,如下所示.data.table 功能 :=set() 通过引用分配给它们传递的任何对象.因此,如果该对象先前已被复制(通过子赋值 <- 或显式 copy(DT)),那么它是通过引用修改的副本.

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

注意即使 a 向量是如何复制的(不同的十六进制值表示向量的新副本),即使 a 没有改变.甚至整个 b 都被复制了,而不是仅仅改变需要改变的元素.对于大数据,避免这一点很重要,以及为什么将 :=set() 引入 data.table.

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

现在,使用我们复制的 newDT 我们可以通过引用修改它:

Now, with our copied newDT we can modify it by reference :

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

请注意,所有 3 个十六进制值(列点向量和 2 列中的每一列)都保持不变.所以它是真正通过引用修改的,完全没有副本.

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

或者,我们可以通过引用修改原来的DT:

Or, we can modify the original DT by reference :

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

那些十六进制值与我们在上面看到的 DT 的原始值相同.键入 example(copy) 以获取更多使用 tracemem 的示例并与 data.frame 进行比较.

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

顺便说一句,如果你 tracemem(DT) 然后 DT[2,b:=600] 你会看到一份报告.这是 print 方法所做的前 10 行的副本.当用 invisible() 包装或在函数或脚本中调用时,不会调用 print 方法.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

所有这些也适用于函数内部;即,:=set() 不会在写入时复制,即使在函数内也是如此.如果需要修改本地副本,则在函数开始时调用x=copy(x).但是,请记住 data.table 适用于大数据(以及针对小数据的更快编程优势).我们故意不想复制大对象(永远).因此,我们不需要考虑通常的 3* 工作记忆因子经验法则.我们尝试只需要与一列一样大的工作内存(即工作内存因子为 1/ncol 而不是 3).

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

这篇关于准确了解 data.table 何时是对另一个 data.table 的引用(相对于副本)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆