R:选择子集而不复制 [英] R: selecting subset without copying

查看:19
本文介绍了R:选择子集而不复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法从对象(数据框、矩阵、向量)选择子集而无需复制所选数据?

Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?

我处理相当大的数据集,但从不更改它们.然而,为了方便起见,我经常选择数据的子集进行操作.每次复制一个大子集的内存效率非常低,但是正常索引和 subset(以及因此 xapply() 函数系列)都会创建所选数据的副本.所以我正在寻找可以克服这个问题的函数或数据结构.

I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset (and thus xapply() family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.

一些可能适合我的需求并希望在一些 R 包中实现的方法:

Some possible approaches that may fit my needs and hopefully are implemented in some R packages:

  • copy-on-write 机制,即仅在添加或重写现有元素时才复制的数据结构;
  • 不可变数据结构,只需要重新创建数据结构的索引信息,而不需要重新创建其内容(例如通过仅创建包含长度和指向相同长度的指针的小对象从字符串中创建子字符串字符数组);
  • xapply() 不创建子集的类似物.
  • copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
  • immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
  • xapply() analogues that do not create subsets.

推荐答案

试用包 参考.具体来说,它的 refdata 类.

Try package ref. Specifically, its refdata class.

您可能对 data.table 遗漏的是,在分组(by= 参数)时,不会复制数据子集,因此速度很快.[从技术上讲,它们只是进入一个共享内存区域,该区域为每个组重复使用,并使用 memcpy 复制,这比 R 在 C 中的 for 循环快得多.]

What you might be missing about data.table is that when grouping (by= parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]

:= 是一种修改 data.table 的方法.data.table 不同于通常的 R 编程风格,因为它不是写时复制.用户必须显式调用 copy() 以复制(可能非常大)表,即使在函数内也是如此.

:= in data.table is one way to modify a data.table in place. data.table departs from usual R programming style in that it is not copied-on-write. User has to call copy() explicitly to copy a (potentially very large) table, even within a function.

您是对的,data.table 中没有像 refdata 这样的机制.我明白你的意思,这将是一个很好的功能.refdata 应该在 data.table 上工作,不过,你可能对 data.frame 没问题(但一定要使用 <代码>tracemem(DF)).

You're right that there isn't a mechanism like refdata built into data.table. I see what you mean and it would be a nice feature. refdata should work on a data.table, though, and you might be fine with data.frame (but be sure to monitor copies with tracemem(DF)).

plyr 包中还有 idata.frame(不可变的 data.frame),你可以试试.

There is also idata.frame (immutable data.frame) in package plyr you could try.

这篇关于R:选择子集而不复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆