“通过引用更新"与浅拷贝 [英] "update by reference" vs shallow copy
问题描述
[.data.table
中的set
函数或表达式:=
允许用户通过引用来更新data.tables.这种行为与将操作结果重新分配给原始 data.frame 有何不同?
The function set
or the expression :=
inside [.data.table
allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?
keepcols<-function(DF,cols){
eval.parent(substitute(DF<-DF[,cols,with=FALSE]))
}
keeprows<-function(DF,i){
eval.parent(substitute(DF<-DF[i,]))
}
由于表达式 <-
中的 RHS 是 R 最新版本中初始数据帧的浅表副本,因此这些函数看起来非常有效.这种基本 R 方法与 data.table 等效方法有何不同?差异仅与速度有关还是与内存使用有关?什么时候差异最大?
Because the RHS in the expression <-
is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?
一些(速度)基准.当数据集只有两个变量时,速度差异似乎可以忽略不计,而变量越多,速度差异就越大.
Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.
library(data.table)
# Long dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
v1 = sample(5, N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.060 0.013 0.077
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.044 0.010 0.060
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
0.132 0.025 0.161
system.time(DT <- DT[,list(id1,v1)])
user system elapsed
0.124 0.026 0.153
# Wide dataset
N=1e7; K=100
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
id2 = sample(sprintf("id%03d",1:K), N, TRUE),
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE),
v3 = sample(round(runif(100,max=100),4), N, TRUE)
)
system.time(DT[,a_inplace:=mean(v1)])
user system elapsed
0.057 0.014 0.089
system.time(DT[,a_inplace:=NULL])
user system elapsed
0.038 0.009 0.061
system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user system elapsed
2.483 0.146 2.602
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
user system elapsed
1.143 0.088 1.220
推荐答案
在 data.table
中,:=
和 all set*
函数通过引用更新对象.这是在 2012 年 IIRC 左右的某个时候引入的.而此时base R 并没有浅拷贝,而是深拷贝.浅副本是从 3.1.0 开始引入的.
In data.table
, :=
and all set*
functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.
这是一个冗长/冗长的答案,但我认为这回答了你的前两个问题:
It's a wordy/lengthy answer, but I think this answers your first two questions:
这种基本 R 方法与等效的 data.table 方法有何不同?差异仅与速度有关还是与内存使用有关?
How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?
当我们这样做时,在基础 R v3.1.0+ 中:
In base R v3.1.0+ when we do:
DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
- 从
DF1
到DF2
,这两列都只是 浅 复制的. - 从
DF2
到DF3
列y
必须单独复制/重新分配,但x
shallow 再次被复制. - 从
DF2
到DF4
,同(2).
- From
DF1
toDF2
, both columns are only shallow copied. - From
DF2
toDF3
the columny
alone had to be copied/re-allocated, butx
gets shallow copied again. - From
DF2
toDF4
, same as (2).
也就是说,只要列保持不变,列就会被浅复制 - 在某种程度上,除非绝对必要,否则复制会被延迟.
That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.
在data.table
中,我们就地修改.即使在 DF3
和 DF4
列 y
也不会被复制.
In data.table
, we modify in-place. Meaning even during DF3
and DF4
column y
doesn't get copied.
DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]
这里,由于y
已经是一个整数列了,而且我们是通过整数来修改它,通过引用,这里根本没有进行新的内存分配.
Here, since y
is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.
当您希望通过引用子分配时,这也特别有用(标记为上面的 (a)).这是 data.table
中我们非常喜欢的一个方便的功能.
This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table
.
另一个免费的优势(我从我们的互动中了解到)是,例如,当我们必须将 data.table 的所有列转换为 numeric
类型时,从比如说,character
类型:
Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric
type, from say, character
type:
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
这里,由于我们通过引用进行更新,每个字符列通过引用替换为对应的数字.在那次替换之后,之前的字符列不再需要,并且可以用于垃圾收集.但是,如果您要使用基本 R 来执行此操作:
Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:
DF[] = lapply(DF, as.numeric)
所有列都必须转换为数字,并且必须保存在一个临时变量中,然后最终将被分配回DF
.这意味着,如果您有 10 列和 1 亿行,每个字符类型,那么您的 DF
占用空间:
All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF
. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF
takes a space of:
10 * 100e6 * 4 / 1024^3 = ~ 3.7GB
而且由于 numeric
类型的大小是原来的两倍,我们总共需要 7.4GB + 3.7GB
的空间来使用 base 进行转换R.
And since numeric
type is twice as much in size, we'll need a total of 7.4GB + 3.7GB
of space for us to make the conversion using base R.
但请注意,data.table
在 DF1
期间复制到 DF2
.那就是:
But note that data.table
copies during DF1
to DF2
. That is:
DT2 = DT1[, c("x", "y")]
产生一个副本,因为我们不能在 浅层 副本上通过引用来子分配.它将更新所有克隆.
results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.
如果我们可以无缝集成浅拷贝功能,但跟踪特定对象的列是否有多个引用,并尽可能通过引用进行更新,那就太好了.R 的升级引用计数功能在这方面可能非常有用.无论如何,我们正在努力实现它.
What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.
最后一个问题:
什么时候差异最大?"
仍有人不得不使用旧版本的 R,深拷贝无法避免.
There are still people who have to use older versions of R, where deep copies can't be avoided.
这取决于复制了多少列,因为您对其执行的操作.当然,最坏的情况是您已经复制了所有列.
It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.
在像 this 这样的情况下,浅拷贝不会受益.
There are cases like this where shallow copying won't benefit.
当您想为 each 组更新 data.frame 的列,并且组太多时.
When you'd like to update columns of a data.frame for each group, and there are too many groups.
如果您想根据与另一个 data.table DT2
的连接来更新例如 data.table DT1
的列 - 这可以是完成为:
When you'd like to update a column of say, data.table DT1
based on a join with another data.table DT2
- this can be done as:
DT1[DT2, col := i.val]
其中 i.
指的是 DT2
的 val
列中的值(i
参数)用于匹配行.这种语法允许非常有效地执行此操作,而不必先连接整个结果,然后更新所需的列.
where i.
refers to the value from val
column of DT2
(the i
argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.
总而言之,有强烈的论据认为通过引用进行更新可以节省大量时间,而且速度很快.但是人们有时喜欢不就地更新对象,并愿意为此牺牲速度/内存.除了现有的引用更新之外,我们还试图找出如何最好地提供此功能.
All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.
希望这会有所帮助.这已经是一个很长的答案了.我会将您可能留下的任何问题留给其他人或让您弄清楚(除了此答案中的任何明显误解).
Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).
这篇关于“通过引用更新"与浅拷贝的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!