通过引用分配给加载的包数据集 [英] assigning by reference into loaded package datasets

查看:61
本文介绍了通过引用分配给加载的包数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个使用 data.table 作为数据集的包,并且有一些函数通过引用使用



我已经构建了一个简单的包来演示我的问题 / p>

  library(devtools)
install_github('foo','mnel')

它包含两个函数

  foo< ; function(x){
x [,a:= 1]
}
fooCall< - function(x){
eval(substitute(x [,a: 1]),parent.frame(1))
}

延迟加载) DT ,使用

  DT <表(b = 1:5)
save(DT,file ='data / DT.rda')

当我安装这个包,我的理解是 foo(DT)应该在 DT

  library(foo)
data(DT)
foo(DT)
ba
1:1 1
2:2 1
3:3 1
4:4 1
5:5 1

#在DT中未通过引用分配

DT
b
1:1
2:2
3:3
4:4
5:5

如果我使用更多正确

  tracmem(DT)
DT< - foo(DT)
#无复制工作
DT
ba
1:1 1
2:2 1
3:3 1
4:4 1
5: 5 1
untracemem(DT)



如果我使用 eval 和code>和之间替换

  DT)
ba
1:1 1
2:2 1
3:3 1
4:4 1
5:5 1
#它通过引用分配
DT
ba
1:1 1
2:2 1
3:3 1
4:4 1
5:5 1

我应该坚持使用


  1. DT <-foo(DT) eval / <$有没有什么我不明白如何数据路由,或

  2. <

    解决方案

    这与数据集或锁定没有关系 - 您可以使用

      DT <-unserialize(serialize(data.table(b = 1:5) NULL))
    foo(DT)
    DT

    data.table 必须在第一次访问 DT 时重新创建对象内的extptr的事实, ,但是它在副本上这样做,所以没有办法在全局环境中与原始版本共享修改。






      DT <-unserialize(serialize(data.table(b = 1:3), NULL))
    DT
    b
    1:1
    2:2
    3:3
    DT [,newcol:= 42]
    DT # 好。 DT反弹到新的浅拷贝(当直接时)
    b newcol
    1:1 42
    2:2 42
    3:3 42

    DT < unserialize(serialize(data.table(b = 1:3),NULL))
    foo(DT)
    ba
    1:1 1
    2:2 1
    3:3 1
    DT#但通过function foo()不正确
    b
    1:1
    2:2
    3:3


      DT <-unserialize(serialize(data.table(b = 1:3),NULL))
    alloc.col(DT)#alloc.col第一个
    b
    1:1
    2:2
    3:3
    foo(DT)
    ba
    1:1 1
    2:2 1
    3:3 1
    DT#现在确定
    ba
    1:1 1
    2:2 1
    3:3 1

    或者,不要将 DT 传入函数,只需直接引用它。使用 data.table 像数据库: .GlobalEnv 中的几个固定名称表。

      DT < -  unserialize(serialize(data.table(b = 1:5),NULL))
    foo& {
    DT [,newcol:= 7]
    }
    foo()
    b newcol
    1:1 7
    2:2 7
    3:3 7
    4:4 7
    5:5 7
    DT#无序列化数据表现在被过度分配和更新确定。
    b newcol
    1:1 7
    2:2 7
    3:3 7
    4:4 7
    5:5 7


    I am in the process of creating a package that uses a data.table as a dataset and has a couple of functions which assign by reference using :=.

    I have built a simple package to demonstrate my problem

     library(devtools)
     install_github('foo','mnel')
    

    It contains two functions

    foo <- function(x){
      x[, a := 1]
    }
    fooCall <- function(x){
      eval(substitute(x[, a :=1]),parent.frame(1))
    } 
    

    and a dataset (not lazy loaded) DT, created using

    DT <- data.table(b = 1:5)
    save(DT, file = 'data/DT.rda')
    

    When I install this package, my understanding is that foo(DT) should assign by reference within DT.

     library(foo)
     data(DT)
     foo(DT)
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    4: 4 1
    5: 5 1
    
    # However this has not assigned by reference within `DT`
    
    DT
       b
    1: 1
    2: 2
    3: 3
    4: 4
    5: 5
    

    If I use the more correct

    tracmem(DT)
    DT <- foo(DT)
    # This works without copying
    DT 
     b a
    1: 1 1
    2: 2 1
    3: 3 1
    4: 4 1
    5: 5 1
    untracemem(DT)
    

    If I use eval and substitute within the function

    fooCall(DT)
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    4: 4 1
    5: 5 1
    # it does assign by reference 
    DT
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    4: 4 1
    5: 5 1
    

    Should I stick with

    1. DT <- foo(DT) or the eval/substitute route, or
    2. Is there something I'm not understanding about how data loads datasets, even when not lazy?

    解决方案

    This has nothing to do with datasets or locking -- you can reproduce it simply using

    DT<-unserialize(serialize(data.table(b = 1:5),NULL))
    foo(DT)
    DT
    

    I suspect it has to do with the fact that data.table has to re-create the extptr inside the object on the first access on DT, but it's doing so on a copy so there is no way it can share the modification with the original in the global environment.


    [From Matthew] Exactly.

    DT<-unserialize(serialize(data.table(b = 1:3),NULL))
    DT
       b
    1: 1
    2: 2
    3: 3
    DT[,newcol:=42]
    DT                 # Ok. DT rebound to new shallow copy (when direct)
       b newcol
    1: 1     42
    2: 2     42
    3: 3     42
    
    DT<-unserialize(serialize(data.table(b = 1:3),NULL))
    foo(DT)
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    DT                 # but not ok when via function foo()
       b
    1: 1
    2: 2
    3: 3
    


    DT<-unserialize(serialize(data.table(b = 1:3),NULL))
    alloc.col(DT)      # alloc.col needed first
       b
    1: 1
    2: 2
    3: 3
    foo(DT)
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    DT                 # now it's ok
       b a
    1: 1 1
    2: 2 1
    3: 3 1
    

    Or, don't pass DT into the function, just refer to it directly. Use data.table like a database: a few fixed name tables in .GlobalEnv.

    DT <- unserialize(serialize(data.table(b = 1:5),NULL))
    foo <- function() {
       DT[, newcol := 7]
    }
    foo()
       b newcol
    1: 1      7
    2: 2      7
    3: 3      7
    4: 4      7
    5: 5      7
    DT              # Unserialized data.table now over-allocated and updated ok.
       b newcol
    1: 1      7
    2: 2      7
    3: 3      7
    4: 4      7
    5: 5      7
    

    这篇关于通过引用分配给加载的包数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆