R可以就地进行诸如cumsum之类的操作吗? [英] Can R do operations like cumsum in-place?

查看:109
本文介绍了R可以就地进行诸如cumsum之类的操作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中,我可以这样做:

a = np.arange(100)
print id(a) # shows some number
a[:] = np.cumsum(a)
print(id(a)) # shows the same number

我在这里所做的是用累积量替换a内容.前后的地址相同.

现在让我们在R中尝试一下:

install.packages('pryr')
library(pryr)
a = 0:99
print(address(a)) # shows some number
a[1:length(a)] = cumsum(a)
print(address(a)) # shows a different number!

问题是如何用计算结果覆盖R中已经分配的内存?当我在R vs. Rcpp中执行矢量运算时,缺少这种事情似乎会导致明显的性能差异(用C ++编写代码并从R中调用它,这使我避免了不必要的分配).

我在具有24个物理核心和128 GB RAM的Ubuntu Linux 10.04上使用R 3.1.1.

解决方案

我做到了

> x = 1:5
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,3,6,10,15

其中,@3acfed60是(共享的)内存地址.关键是NAM(1),它表示对x有一个引用,因此不需要在更新时重新分配.

R使用(当前,我认为这将在下一发行版中更改)引用计数的版本,其中R符号是引用0、1或大于1倍;当一个对象被多次引用时,其引用计数不能递减(因为大于一个"可能表示3,因此无法区分2个引用和3个引用,因此也无法区分小于2的引用)且小于3).任何修改尝试都需要重复.

最初,我忘记加载pryr并编写了自己的address()

> address = function(x) .Internal(inspect(x))

这揭示了一个有趣的问题

> x = 1:5
> address(x)
@4647128 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> address(x)
@4647098 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,3,6,10,15

通知NAM(2),它表示在函数内部至少有两个对x的引用,即在全局环境和函数环境中.因此,在函数内部触摸x会触发将来的重复,这类似于海森堡不确定性原理. cumsum(以及.Internallength)的编写方式允许引用而不必增加对NAMED的引用; address()应该修改为具有类似的行为(现在已已修复) >

嗯,当我深入研究时,我发现(回想起来,我想这很明显)实际上是cumsum(x) 确实通过S表达式分配内存

> x = 1:5
> .Internal(inspect(x))
@3bb1cd0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> .Internal(inspect(cumsum(x)))
@43919d0 13 INTSXP g0c3 [] (len=5, tl=0) 1,3,6,10,15

,但赋值x[] <-将新内存与旧位置(??)关联. (这似乎和data.table一样高效",显然它也为cumsum创建了一个S表达式,大概是因为它本身调用了cumsum!)所以,在这个答案中,我大多没有帮助...

分配本身并不会导致性能问题,而应该是不再使用的内存的垃圾回收(gcinfo(TRUE)可以看到这些问题).我发现用-p启动R很有用

R --no-save --quiet --min-vsize=2048M --min-nsize=45M

从较大的内存池开始,因此较少(初始)垃圾回收.分析您的编码样式以了解为什么将其视为性能瓶颈将很有用.

In Python I can do this:

a = np.arange(100)
print id(a) # shows some number
a[:] = np.cumsum(a)
print(id(a)) # shows the same number

What I did here was to replace the contents of a with its cumsum. The address before and after is the same.

Now let's try it in R:

install.packages('pryr')
library(pryr)
a = 0:99
print(address(a)) # shows some number
a[1:length(a)] = cumsum(a)
print(address(a)) # shows a different number!

The question is how can I overwrite already-allocated memory in R with the results of computations? The lack of this sort of thing seems to be causing significant performance discrepancies when I do vector operations in R vs. Rcpp (writing code in C++ and calling it from R, which lets me avoid unnecessary allocations).

I'm using R 3.1.1 on Ubuntu Linux 10.04 with 24 physical cores and 128 GB of RAM.

解决方案

I did this

> x = 1:5
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,3,6,10,15

where the @3acfed60 is the (shared) memory address. The key is NAM(1), which says that there's a single reference to x, hence no need to re-allocate on update.

R uses (currently, I think this will change in the next release) a version of reference counting where an R symbol is reference 0, 1, or more than 1 times; when an object is referenced more than once, its reference count can't be decremented (because 'more than one' could mean 3, hence no way to distinguish between 2 references and 3 references, hence no way to distinguish between one less than 2 and one less than 3). Any attempt at modification needs to duplicate.

Originally I forgot to load pryr and wrote my own address()

> address = function(x) .Internal(inspect(x))

which reveals an interesting problem

> x = 1:5
> address(x)
@4647128 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> address(x)
@4647098 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,3,6,10,15

Notice NAM(2), which says that inside the function there are at least two references to x, i.e., in the global environment, and in the function environment. So touching x inside a function triggers future duplication, sort of a Heisenberg uncertainty principle. cumsum (and .Internal, and length) are written in a way that allows reference without increment to NAMED; address() should be revised to have similar behavior (this has now been fixed)

Hmm, when I dig a little deeper I see (I guess it's obvious, in retrospect) that what actually happens is that cumsum(x) does allocate memory via an S-expression

> x = 1:5
> .Internal(inspect(x))
@3bb1cd0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> .Internal(inspect(cumsum(x)))
@43919d0 13 INTSXP g0c3 [] (len=5, tl=0) 1,3,6,10,15

but the assignment x[] <- associates the new memory with the old location (??). (This seems to be 'as efficient' as data.table, which apparently also creates an S-expression for cumsum, presumably because it's calling cumsum itself!) So mostly I've not been helpful in this answer...

It's not likely that the allocation per se causes performance problems, but rather garbage collection (gcinfo(TRUE) to see these) of the no longer used memory. I find it useful to launch R with

R --no-save --quiet --min-vsize=2048M --min-nsize=45M

which starts with a larger memory pool hence fewer (initial) garbage collections. It would be useful to analyze your coding style to understand why you find this as the performance bottleneck.

这篇关于R可以就地进行诸如cumsum之类的操作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆