“通过引用更新”;与浅拷贝 [英] "update by reference" vs shallow copy

查看:42
本文介绍了“通过引用更新”;与浅拷贝的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

函数 set 或表达式:= [。data.table 允许用户通过引用更新data.tables。这种行为与将运算结果重新分配给原始data.frame有何不同?

The function set or the expression := inside [.data.table allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?

keepcols<-function(DF,cols){
  eval.parent(substitute(DF<-DF[,cols,with=FALSE]))  
}
keeprows<-function(DF,i){
   eval.parent(substitute(DF<-DF[i,]))
}

由于表达式 <-中的RHS是R的最新版本中初始数据帧的浅表副本,因此这些函数似乎非常有效。此基本R方法与data.table等效方法有何不同?差异仅与速度有关,还是与内存使用有关?何时差异最大?

Because the RHS in the expression <- is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?

某些(速度)基准。当数据集只有两个变量时,速度差异似乎可以忽略不计,而随着更多的变量而变大。

Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.

library(data.table)

# Long dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),     
   v1 =  sample(5, N, TRUE)                                         
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
 0.060   0.013   0.077 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
0.044   0.010   0.060 


system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
0.132   0.025   0.161  
system.time(DT <- DT[,list(id1,v1)])
user  system elapsed 
0.124   0.026   0.153 


# Wide dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), 
   v1 =  sample(5, N, TRUE),                          
   v2 =  sample(1e6, N, TRUE),                        
   v3 =  sample(round(runif(100,max=100),4), N, TRUE)                    
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
  0.057   0.014   0.089 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
  0.038   0.009   0.061 

system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
2.483   0.146   2.602 
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
 user  system elapsed 
 1.143   0.088   1.220 


推荐答案

data.table := 全部中 set * 函数通过引用更新对象。这是在2012年IIRC左右的某个时间引入的。而此时,基址R不是 浅拷贝,而是 deep 复制。 副本是从3.1.0版本开始引入的。

In data.table, := and all set* functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.

这是冗长的回答,但是我认为这回答了您的前两个问题:

It's a wordy/lengthy answer, but I think this answers your first two questions:


此基本R方法与data.table等效项有何不同?差异仅与速度相关,还是与内存使用相关?

How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?

在base R v3.1.0 +中,当我们这样做时:

In base R v3.1.0+ when we do:

DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)




  1. 来自 DF1 DF2 ,两个列都只复制 shallow

  2. DF2 DF3 y 必须单独复制/重新分配,但是 x 会再次复制 shallow

  3. DF2 DF4 ,与(2)相同。

  1. From DF1 to DF2, both columns are only shallow copied.
  2. From DF2 to DF3 the column y alone had to be copied/re-allocated, but x gets shallow copied again.
  3. From DF2 to DF4, same as (2).

也就是说,只要列保持不变,列是浅复制的-在某种程度上,除非绝对必要,否则复制会被延迟

That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.

data.table 中,我们就地修改 。即使在 DF3 DF4 y 中也没有含义

In data.table, we modify in-place. Meaning even during DF3 and DF4 column y doesn't get copied.

DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]

在这里,因为 y 已经是一个整数列,通过引用我们正在对它进行整数修改,这里根本没有新的内存分配。

Here, since y is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.

这也特别有用当您想通过引用 sub-assign (在上面标记为(a))时。这是我们在 data.table 中真正喜欢的便捷功能。

This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table.

另一个免费提供的优势(我从我们的交互作用中得知),例如,当我们必须将data.table的所有列转换为数字类型时,例如从字符类型:

Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric type, from say, character type:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

此处,由于我们要通过引用进行更新,因此每个字符列通过引用与它的数字对应部分被替换。替换之后,不再需要前面的字符列,可以进行垃圾收集。但是,如果要使用基数R:

Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:

DF[] = lapply(DF, as.numeric)

所有列都必须转换为数字,并且必须保留在中临时变量,然后最终将其分配回 DF 。这意味着,如果您有10列,每行包含一亿个字符类型,那么 DF 的空间为:

All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF takes a space of:

10 * 100e6 * 4 / 1024^3 = ~ 3.7GB

并且由于数字类型的大小是原来的两倍,因此我们总共需要 7.4GB + 3.7GB <

And since numeric type is twice as much in size, we'll need a total of 7.4GB + 3.7GB of space for us to make the conversion using base R.

/ code>的空间,以便我们使用基数R进行转换。但是请注意, data.table DF1 期间复制到 DF2 。即:

But note that data.table copies during DF1 to DF2. That is:

DT2 = DT1[, c("x", "y")]

产生一个副本,因为我们不能通过引用 shallow sub-assign 复制。

results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.

最棒的是,如果我们可以无缝集成浅层复制功能,但要跟踪特定对象的列是否具有多个引用, ,并尽可能通过引用进行更新。 R的升级参考计数功能在这方面可能非常有用。无论如何,我们都在朝着这个目标努力。

What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.

最后一个问题:


什么时候差异最大?

"When is the difference most sizeable?"




  1. 仍有一些人必须使用较旧的R版本,在这些版本中不可避免地会避免深拷贝。

  1. There are still people who have to use older versions of R, where deep copies can't be avoided.

这取决于如何正在复制许多列,因为您对其执行了操作。当然,最糟糕的情况是您复制了所有列。

It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.

在某些情况下,例如不会受益。

There are cases like this where shallow copying won't benefit.

当您要为每个组更新data.frame的列时,有群组太多。

When you'd like to update columns of a data.frame for each group, and there are too many groups.

当您想要更新某列时,data.table DT1 基于与另一个数据的联接。表 DT2 -可以这样操作:

When you'd like to update a column of say, data.table DT1 based on a join with another data.table DT2 - this can be done as:

DT1[DT2, col := i.val]

其中 i。是指 DT2 val 列中的值( i 参数)匹配行。使用这种语法,可以非常有效地执行此操作,而不必首先加入整个结果,然后更新所需的列。

where i. refers to the value from val column of DT2 (the i argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.

总而言之,有很强的论据,其中引用更新可以节省大量时间,而且速度很快。但是人们有时喜欢不就地更新对象,并愿意为此牺牲速度/内存。除了通过引用提供的现有更新之外,我们还试图找出如何最好地提供此功能。

All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.

希望这会有所帮助。这已经是一个冗长的答案。我会留下您可能留给他人或让您解决的任何问题(此答案中没有任何明显的误解)。

Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).

这篇关于“通过引用更新”;与浅拷贝的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆