使用C ++或Scilab或Octave或R中的大量数据进行统计 [英] statistics wiht large amount of data in C++ or Scilab or Octave or R
问题描述
我最近需要计算大量(大约8亿)双打的均值和标准差.考虑到一个双精度型占用8个字节,如果将所有双精度型读取到ram中,则大约需要6 GB.我想我可以对C ++或其他高级语言使用分而治之的方法,但这似乎很乏味.有没有办法我可以同时使用R,Scilab或Octave之类的高级语言来完成所有这些工作?谢谢.
I recently need to calculate the mean and standard deviation of a large number (about 800,000,000) of doubles. Considering that a double takes 8 bytes, if all the doubles are read into ram, it will take about 6 GB. I think I can use a divide and conquer approach with C++ or other high level languages, but that seems tedious. Is there a way that I can do this all at once with high level languages like R, Scilab or Octave? Thanks.
推荐答案
不是声称这是最佳选择,但是在python(带有numpy和numexpr模块)中,以下操作很容易(在8G RAM机器上):
Not claiming that this is optimal, but in python (with numpy and numexpr modules) the following is easy (on 8G RAM machine):
import numpy, numpy as np, numexpr
x = np.random.uniform(0, 1, size=8e8)
print x.mean(), (numexpr.evaluate('sum(x*x)')/len(x)-
(numexpr.evaluate('sum(x)')/len(x))**2)**.5
>>> 0.499991593345 0.288682001731
这不会比原始数组消耗更多的内存.
This doesn't consume more memory than the original array.
这篇关于使用C ++或Scilab或Octave或R中的大量数据进行统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!