了解data.table是什么时候是另一个data.table的(对的副本)的引用 [英] Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
问题描述
我对理解 data.table
的传递引用属性有点麻烦。
在创建 data.table
data.table
(通过< -
),然后更新:=
,原始表也被改变。这是预期的,如: ? data.table :: copy
和 stackoverflow:pass-by-reference-the-operator-in-the-data-table-package
library(data.table)
DT< - data.table(a = c 1,2),b = c(11,12))
print(DT)
#ab
#[1,] 1 11
#[2,
newDT < - DT#reference,not copy
newDT [1,a:= 100]#修改新的DT
打印修改过。
#ab
#[1,] 100 11
#[2,] 2 12
但是,如果我在 < - $ c>之间插入一个非< - >
:=
$ c>分配和:=
行,现在不再修改 DT
:
DT = data.table(a = c(1,2),b = c(11,12))
newDT < DT
newDT $ b [2] < - 200#new operation
newDT [1,a:= 100]
print(DT)
#ab
#[1,] 1 11
#[2,] 2 12
所以看来 newDT $ b [2]< - 200
行不知何故打破了引用。我想这个调用一个副本不知何故,但我想完全了解R是如何处理这些操作,以确保我不会在我的代码中引入潜在的错误。
是的,它是R中的子分配使用< -
(或 =
或 - >
)制作整个对象的副本。您可以使用 tracemem(DT)
和 .Internal(inspect(DT))
跟踪,如下所示。 data.table
特征:=
和 set()
通过引用分配它们通过的任何对象。因此,如果该对象以前被复制(通过分配 < -
或显式的复制(DT)
),
DT < - data.table(a = c(1,2),b = c(11,12))
newDT <-DT
。内部(检查(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM ,ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#..snip ..
。同一对象
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM (len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#.. snip ..
tracemem(newDT)
#[1]< 0x0000000003b7e2a0
newDT $ b [2]< - 200
#tracemem [0000000003B7E2A0 - > 00000000040ED948]:
#tracemem [00000000040ED948 - > 00000000040ED830]:.Call copy $< - 。data.table $< -
.Internal(inspect(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM ),TR,ATT](len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,12
#ATTRIB:#..snip ..
。
#@ 0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2,tl = 100)
#@ 00000000040ED7F8 14 REALSXP g0c2 [NAM(2)](len = tl = 0)1,2
#@ 00000000040ED8D8 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)11,200
#ATTRIB:#..snip ..
注意如何复制 a
向量(不同的十六进制值)。
表示向量的新副本),即使 a
未更改。甚至整个 b
被复制,而不是仅仅改变需要改变的元素。这对于避免大数据很重要,为什么:=
和 set()
> data.table 。
现在,使用我们复制的 newDT
通过引用:
newDT
#ab
#[1,] 1 11
# [2,] 2 200
newDT [2,b:= 400]
#ab#查看常见问题2.21为什么打印newDT
#[1,] 1 11
#[2,] 2 400
.Internal(inspect(newDT))
#@ 0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT](len = 2 ,tl = 100)
#@ 00000000040ED7F8 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040ED8D8 14 REALSXP g0c2 [NAM len = 2,tl = 0)11,400
#ATTRIB:#..snip ..
请注意,所有3个十六进制值(列点的向量和2列中的每一列)保持不变。
或者,我们可以通过修改原来的 DT
参考:
DT [2,b:= 600]
#ab
#[1,] 1 11
#[2,] 2 600
。内部(检查(DT))
#@ 0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] len = 2,tl = 100)
#@ 00000000040C2288 14 REALSXP g0c2 [NAM(2)](len = 2,tl = 0)1,2
#@ 00000000040C2250 14 REALSXP g0c2 [NAM )](len = 2,tl = 0)11,600
#ATTRIB:#..snip ..
$ b b
这些十六进制值与我们在上面看到的 DT
的原始值相同。使用 tracemem
并与 data.frame 比较的更多示例,请键入
example(copy)
。
Btw,如果 tracemem(DT)
[2,b:= 600] ,您就会看到一份报告。这是 print
方法的前10行的副本。当用 invisible()
包装时,或者在函数或脚本中调用时,不会调用 print
方法。
这一切都适用于函数内部;即:=
和 set()
不要在写时复制,即使在函数内。如果需要修改本地副本,则在函数开始时调用 x = copy(x)
。但是,请记住 data.table
用于大数据(以及对小数据更快的编程优势)。我们故意不想复制大对象(永远)。因此,我们不需要允许通常的3 *工作记忆因素经验法则。我们试图只需要工作记忆大到一列(即工作记忆因子为1 / ncol而不是3)。
I'm having a little trouble understanding the pass-by-reference properties of data.table
. Some operations seem to 'break' the reference, and I'd like to understand exactly what's happening.
On creating a data.table
from another data.table
(via <-
, then updating the new table by :=
, the original table is also altered. This is expected, as per:
?data.table::copy
and stackoverflow: pass-by-reference-the-operator-in-the-data-table-package
Here's an example:
library(data.table)
DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
newDT <- DT # reference, not copy
newDT[1, a := 100] # modify new DT
print(DT) # DT is modified too.
# a b
# [1,] 100 11
# [2,] 2 12
However, if I insert a non-:=
based modification between the <-
assignment and the :=
lines above, DT
is now no longer modified:
DT = data.table(a=c(1,2), b=c(11,12))
newDT <- DT
newDT$b[2] <- 200 # new operation
newDT[1, a := 100]
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
So it seems that the newDT$b[2] <- 200
line somehow 'breaks' the reference. I'd guess that this invokes a copy somehow, but I would like to understand fully how R is treating these operations, to ensure I don't introduce potential bugs in my code.
I'd very much appreciate if someone could explain this to me.
Yes, it's subassignment in R using <-
(or =
or ->
) that makes a copy of the whole object. You can trace that using tracemem(DT)
and .Internal(inspect(DT))
, as below. The data.table
features :=
and set()
assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <-
or an explicit copy(DT)
) then it's the copy that gets modified by reference.
DT <- data.table(a = c(1, 2), b = c(11, 12))
newDT <- DT
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
tracemem(newDT)
# [1] "<0x0000000003b7e2a0"
newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..
Notice how even the a
vector was copied (different hex value indicates new copy of vector), even though a
wasn't changed. Even the whole of b
was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why :=
and set()
were introduced to data.table
.
Now, with our copied newDT
we can modify it by reference :
newDT
# a b
# [1,] 1 11
# [2,] 2 200
newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..
Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.
Or, we can modify the original DT
by reference :
DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..
Those hex values are the same as the original values we saw for DT
above. Type example(copy)
for more examples using tracemem
and comparison to data.frame
.
Btw, if you tracemem(DT)
then DT[2,b:=600]
you'll see one copy reported. That is a copy of the first 10 rows that the print
method does. When wrapped with invisible()
or when called within a function or script, the print
method isn't called.
All this applies inside functions too; i.e., :=
and set()
do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x)
at the start of the function. But, remember data.table
is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).
这篇关于了解data.table是什么时候是另一个data.table的(对的副本)的引用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!