“通过引用更新”;与浅拷贝 [英] "update by reference" vs shallow copy
问题描述
函数 set
或表达式:=
在 [。data.table
允许用户通过引用更新data.tables。这种行为与将运算结果重新分配给原始data.frame有何不同?
The function set
or the expression :=
inside [.data.table
allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?
keepcols<-function(DF,cols){
eval.parent(substitute(DF<-DF[,cols,with=FALSE]))
}
keeprows<-function(DF,i){
eval.parent(substitute(DF<-DF[i,]))
}
由于表达式 <-
中的RHS是R的最新版本中初始数据帧的浅表副本,因此这些函数似乎非常有效。此基本R方法与data.table等效方法有何不同?差异仅与速度有关,还是与内存使用有关?何时差异最大?
Because the RHS in the expression <-
is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?
某些(速度)基准。当数据集只有两个变量时,速度差异似乎可以忽略不计,而随着更多的变量而变大。
Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.
library(data.table)
# Long dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
v1 = sample(5, N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.060 0.013 0.077
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.044 0.010 0.060
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
0.132 0.025 0.161
system.time(DT <- DT[,list(id1,v1)])
user system elapsed
0.124 0.026 0.153
# Wide dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
id2 = sample(sprintf("id%03d",1:K), N, TRUE),
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE),
v3 = sample(round(runif(100,max=100),4), N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.057 0.014 0.089
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.038 0.009 0.061
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
2.483 0.146 2.602
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
user system elapsed
1.143 0.088 1.220
推荐答案
在 data.table
,:=
和全部中 set *
函数通过引用更新对象。这是在2012年IIRC左右的某个时间引入的。而此时,基址R不是 浅拷贝,而是 deep 复制。 浅副本是从3.1.0版本开始引入的。
In data.table
, :=
and all set*
functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.
这是冗长的回答,但是我认为这回答了您的前两个问题:
It's a wordy/lengthy answer, but I think this answers your first two questions:
此基本R方法与data.table等效项有何不同?差异仅与速度相关,还是与内存使用相关?
How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?
在base R v3.1.0 +中,当我们这样做时:
In base R v3.1.0+ when we do:
DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
- 来自
DF1
到DF2
,两个列都只复制 shallow 。 - 从
DF2
到DF3
列y
必须单独复制/重新分配,但是x
会再次复制 shallow 。 - 从
DF2
到DF4
,与(2)相同。
- From
DF1
toDF2
, both columns are only shallow copied. - From
DF2
toDF3
the columny
alone had to be copied/re-allocated, butx
gets shallow copied again. - From
DF2
toDF4
, same as (2).
也就是说,只要列保持不变,列是浅复制的-在某种程度上,除非绝对必要,否则复制会被延迟
That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.
在 data.table
中,我们就地修改 。即使在 DF3
和 DF4
列 y
中也没有含义
In data.table
, we modify in-place. Meaning even during DF3
and DF4
column y
doesn't get copied.
DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]
在这里,因为 y
已经是一个整数列,通过引用我们正在对它进行整数修改,这里根本没有新的内存分配。
Here, since y
is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.
这也特别有用当您想通过引用 sub-assign (在上面标记为(a))时。这是我们在 data.table
中真正喜欢的便捷功能。
This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table
.
另一个免费提供的优势(我从我们的交互作用中得知),例如,当我们必须将data.table的所有列转换为数字
类型时,例如从字符
类型:
Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric
type, from say, character
type:
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
此处,由于我们要通过引用进行更新,因此每个字符列通过引用与它的数字对应部分被替换。替换之后,不再需要前面的字符列,可以进行垃圾收集。但是,如果要使用基数R:
Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:
DF[] = lapply(DF, as.numeric)
所有列都必须转换为数字,并且必须保留在中临时变量,然后最终将其分配回 DF
。这意味着,如果您有10列,每行包含一亿个字符类型,那么 DF
的空间为:
All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF
. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF
takes a space of:
10 * 100e6 * 4 / 1024^3 = ~ 3.7GB
并且由于数字
类型的大小是原来的两倍,因此我们总共需要 7.4GB + 3.7GB <
And since numeric
type is twice as much in size, we'll need a total of 7.4GB + 3.7GB
of space for us to make the conversion using base R.
/ code>的空间,以便我们使用基数R进行转换。但是请注意, data.table
在 DF1
期间复制到 DF2
。即:
But note that data.table
copies during DF1
to DF2
. That is:
DT2 = DT1[, c("x", "y")]
产生一个副本,因为我们不能通过引用 shallow sub-assign 复制。
results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.
最棒的是,如果我们可以无缝集成浅层复制功能,但要跟踪特定对象的列是否具有多个引用, ,并尽可能通过引用进行更新。 R的升级参考计数功能在这方面可能非常有用。无论如何,我们都在朝着这个目标努力。
What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.
最后一个问题:
什么时候差异最大?
"When is the difference most sizeable?"
-
仍有一些人必须使用较旧的R版本,在这些版本中不可避免地会避免深拷贝。
There are still people who have to use older versions of R, where deep copies can't be avoided.
这取决于如何正在复制许多列,因为您对其执行了操作。当然,最糟糕的情况是您复制了所有列。
It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.
在某些情况下,例如此不会受益。
There are cases like this where shallow copying won't benefit.
当您要为每个组更新data.frame的列时,有群组太多。
When you'd like to update columns of a data.frame for each group, and there are too many groups.
当您想要更新某列时,data.table DT1
基于与另一个数据的联接。表 DT2
-可以这样操作:
When you'd like to update a column of say, data.table DT1
based on a join with another data.table DT2
- this can be done as:
DT1[DT2, col := i.val]
其中 i。
是指 DT2
的 val
列中的值( i
参数)匹配行。使用这种语法,可以非常有效地执行此操作,而不必首先加入整个结果,然后更新所需的列。
where i.
refers to the value from val
column of DT2
(the i
argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.
总而言之,有很强的论据,其中引用更新可以节省大量时间,而且速度很快。但是人们有时喜欢不就地更新对象,并愿意为此牺牲速度/内存。除了通过引用提供的现有更新之外,我们还试图找出如何最好地提供此功能。
All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.
希望这会有所帮助。这已经是一个冗长的答案。我会留下您可能留给他人或让您解决的任何问题(此答案中没有任何明显的误解)。
Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).
这篇关于“通过引用更新”;与浅拷贝的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!