R:通过引用传递数据帧 [英] R: Passing a data frame by reference

查看:108
本文介绍了R:通过引用传递数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R具有pass-by-value语义,可以最大限度地减少意外的副作用(一件好事)。然而,当代码被组织成用于可重用性/可读性/可维护性的许多函数/方法时,并且当该代码需要通过例如大数据帧来操纵大型数据结构时,通过一系列变换/操作,pass-by-value语义导致到大量的数据复制和堆栈颠簸(一个坏东西)。例如,作为函数参数传递的在堆上占用50Mb的数据帧将以函数调用深度的最小相同次数进行复制,并且调用堆栈底部的堆大小将为N * 50Mb。如果函数从调用链中深处返回一个变换/修改的数据框,那么复制将增加另一个N。

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.

SO问题​​避免传递数据框的最好方法是什么? a>触及这个主题,但是避免直接询问传递引用问题的方式,而获胜答案基本上说,是的,传递值是R如何工作。这不是100%准确。 R环境支持按引用传递语义和OO框架,例如 proto 使用此功能广泛。例如,当一个proto对象作为一个函数参数传递,而它的魔术包装器通过值传递给R开发人员时,语义通过引用传递。

The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.

似乎通过引用传递大数据框将是一个常见的问题,我想知道别人如何接近它,以及是否有任何库启用这一点。在我的搜索我没有发现一个。

It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.

如果没有可用的东西,我的方法是创建一个包裹数据框架的proto对象。我将欣赏有关语法糖的指针,应该添加到此对象,以使其有用,例如重载$和[[操作符,以及任何陷阱,我应该注意。我不是R专家。

If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.

一个类型无关的通过引用解决方案的积分,与R完美集成,虽然我的需要是专门使用数据框。

Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.

推荐答案

问题的前提是(部分)不正确。 R作为pass-by-promise,并且只有当promise被传递时,进一步分配和更改数据框架时,你才能以你概述的方式重复复制。因此,副本的数量将不是N *大小,其中N是堆栈深度,而是其中N是进行分配的级别的数量。然而,你是正确的,环境可能是有用的。我看到下面的链接,你已经找到了'proto'包。还有一个相对最近引入的引用类,有时被称为R5,其中R / S3是在R中复制的S3的原始类系统,并且R4将是最近似乎主要支持

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.

这里是一个链接到Steve Lianoglou(在讨论引用类的优点的线程中)在S4对象中嵌入一个环境的例子避免复制费用:

Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

Matthew Dowle的' data.table'包创建一个新类型的数据对象,其访问语义使用[不同于常规R data.frames的,并且真正作为传递引用。它具有出色的访问和处理速度。它也可以落在数据帧语义上,因为在以后的几年中,这样的对象现在继承了'data.frame'类。

Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.

您还可以调查 Hesterberg的数据包软件包

这篇关于R:通过引用传递数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆