“通过引用更新"与浅拷贝 [英] "update by reference" vs shallow copy

查看:12
本文介绍了“通过引用更新"与浅拷贝的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[.data.table 中的set 函数或表达式:= 允许用户通过引用来更新data.tables.这种行为与将操作结果重新分配给原始 data.frame 有何不同?

The function set or the expression := inside [.data.table allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?

keepcols<-function(DF,cols){
  eval.parent(substitute(DF<-DF[,cols,with=FALSE]))  
}
keeprows<-function(DF,i){
   eval.parent(substitute(DF<-DF[i,]))
}

由于表达式 <- 中的 RHS 是 R 最新版本中初始数据帧的浅表副本,因此这些函数看起来非常有效.这种基本 R 方法与 data.table 等效方法有何不同?差异仅与速度有关还是与内存使用有关?什么时候差异最大?

Because the RHS in the expression <- is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?

一些(速度)基准.当数据集只有两个变量时,速度差异似乎可以忽略不计,而变量越多,速度差异就越大.

Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.

library(data.table)

# Long dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),     
   v1 =  sample(5, N, TRUE)                                         
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
 0.060   0.013   0.077 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
0.044   0.010   0.060 


system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
0.132   0.025   0.161  
system.time(DT <- DT[,list(id1,v1)])
user  system elapsed 
0.124   0.026   0.153 


# Wide dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), 
   v1 =  sample(5, N, TRUE),                          
   v2 =  sample(1e6, N, TRUE),                        
   v3 =  sample(round(runif(100,max=100),4), N, TRUE)                    
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
  0.057   0.014   0.089 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
  0.038   0.009   0.061 

system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
2.483   0.146   2.602 
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
 user  system elapsed 
 1.143   0.088   1.220 

推荐答案

data.table 中,:=all set* 函数通过引用更新对象.这是在 2012 年 IIRC 左右的某个时候引入的.而此时base R 并没有浅拷贝,而是拷贝.副本是从 3.1.0 开始引入的.

In data.table, := and all set* functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.

这是一个冗长/冗长的答案,但我认为这回答了你的前两个问题:

It's a wordy/lengthy answer, but I think this answers your first two questions:

这种基本 R 方法与等效的 data.table 方法有何不同?差异仅与速度有关还是与内存使用有关?

How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?

当我们这样做时,在基础 R v3.1.0+ 中:

In base R v3.1.0+ when we do:

DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)

  1. DF1DF2,这两列都只是 复制的.
  2. DF2DF3y 必须单独复制/重新分配,但 xshallow 再次被复制.
  3. DF2DF4,同(2).
  1. From DF1 to DF2, both columns are only shallow copied.
  2. From DF2 to DF3 the column y alone had to be copied/re-allocated, but x gets shallow copied again.
  3. From DF2 to DF4, same as (2).

也就是说,只要列保持不变,列就会被浅复制 - 在某种程度上,除非绝对必要,否则复制会被延迟.

That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.

data.table中,我们就地修改.即使在 DF3DF4y 也不会被复制.

In data.table, we modify in-place. Meaning even during DF3 and DF4 column y doesn't get copied.

DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]

这里,由于y已经是一个整数列了,而且我们是通过整数来修改它,通过引用,这里根本没有进行新的内存分配.

Here, since y is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.

当您希望通过引用子分配时,这也特别有用(标记为上面的 (a)).这是 data.table 中我们非常喜欢的一个方便的功能.

This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table.

另一个免费的优势(我从我们的互动中了解到)是,例如,当我们必须将 data.table 的所有列转换为 numeric 类型时,从比如说,character 类型:

Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric type, from say, character type:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

这里,由于我们通过引用进行更新,每个字符列通过引用替换为对应的数字.在那次替换之后,之前的字符列不再需要,并且可以用于垃圾收集.但是,如果您要使用基本 R 来执行此操作:

Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:

DF[] = lapply(DF, as.numeric)

所有列都必须转换为数字,并且必须保存在一个临时变量中,然后最终将被分配回DF.这意味着,如果您有 10 列和 1 亿行,每个字符类型,那么您的 DF 占用空间:

All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF takes a space of:

10 * 100e6 * 4 / 1024^3 = ~ 3.7GB

而且由于 numeric 类型的大小是原来的两倍,我们总共需要 7.4GB + 3.7GB 的空间来使用 base 进行转换R.

And since numeric type is twice as much in size, we'll need a total of 7.4GB + 3.7GB of space for us to make the conversion using base R.

但请注意,data.tableDF1 期间复制到 DF2.那就是:

But note that data.table copies during DF1 to DF2. That is:

DT2 = DT1[, c("x", "y")]

产生一个副本,因为我们不能在 浅层 副本上通过引用来子分配.它将更新所有克隆.

results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.

如果我们可以无缝集成浅拷贝功能,但跟踪特定对象的列是否有多个引用,并尽可能通过引用进行更新,那就太好了.R 的升级引用计数功能在这方面可能非常有用.无论如何,我们正在努力实现它.

What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.

最后一个问题:

什么时候差异最大?"

  1. 仍有人不得不使用旧版本的 R,深拷贝无法避免.

  1. There are still people who have to use older versions of R, where deep copies can't be avoided.

这取决于复制了多少列,因为您对其执行的操作.当然,最坏的情况是您已经复制了所有列.

It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.

在像 this 这样的情况下,浅拷贝不会受益.

There are cases like this where shallow copying won't benefit.

当您想为 each 组更新 data.frame 的列,并且组太多时.

When you'd like to update columns of a data.frame for each group, and there are too many groups.

如果您想根据与另一个 data.table DT2 的连接来更新例如 data.table DT1 的列 - 这可以是完成为:

When you'd like to update a column of say, data.table DT1 based on a join with another data.table DT2 - this can be done as:

DT1[DT2, col := i.val]

其中 i. 指的是 DT2val 列中的值(i 参数)用于匹配行.这种语法允许非常有效地执行此操作,而不必先连接整个结果,然后更新所需的列.

where i. refers to the value from val column of DT2 (the i argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.

总而言之,有强烈的论据认为通过引用进行更新可以节省大量时间,而且速度很快.但是人们有时喜欢不就地更新对象,并愿意为此牺牲速度/内存.除了现有的引用更新之外,我们还试图找出如何最好地提供此功能.

All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.

希望这会有所帮助.这已经是一个很长的答案了.我会将您可能留下的任何问题留给其他人或让您弄清楚(除了此答案中的任何明显误解).

Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).

这篇关于“通过引用更新"与浅拷贝的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆