如何找到在一个大组数字的平均值? [英] How do I find the average in a LARGE set of numbers?

查看:109
本文介绍了如何找到在一个大组数字的平均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大组数字,大概在数千兆字节不等。第一个问题是,我不能所有这些存储在内存中。第二是,在除了这些的任何企图将导致溢出。我想用更多的滚动平均的,但它需要是准确的。任何想法?

I have a large set of numbers, probably in the multiple gigabytes range. First issue is that I can't store all of these in memory. Second is that any attempt at addition of these will result in an overflow. I was thinking of using more of a rolling average, but it needs to be accurate. Any ideas?

这些都是浮点数。

这是不是从数据库中读取,它是CSV文件从多个来源收集。它有作为它存储为第二(例如; 0.293482888929)的部分是准确和滚动平均值可以是0.2和.3之间的差

This is not read from a database, it is a CSV file collected from multiple sources. It has to be accurate as it is stored as parts of a second (e.g; 0.293482888929) and a rolling average can be the difference between .2 and .3

它是一套#的代表长期用户如何走上某种形式的行动作出回应。为了显示一个消息框时,例如,如何长时间才带他们去按确定或取消。该数据被发送到存储为第二seconds.portions我。 1.2347秒例如。将其转换为毫秒,我溢出的int,long等相当快。即使我不把它转换,我还是溢出它相当快。我想下面的一个答案是正确的,也许我没有100%准确的,在一定范围内只看一个sepcific STDDEV内,我会非常接近。

It is a set of #'s representing how long users took to respond to certain form actions. For example when showing a messagebox, how long did it take them to press OK or Cancel. The data was sent to me stored as seconds.portions of a second; 1.2347 seconds for example. Converting it to milliseconds and I overflow int, long, etc.. rather quickly. Even if I don't convert it, I still overflow it rather quickly. I guess the one answer below is correct, that maybe I don't have to be 100% accurate, just look within a certain range inside of a sepcific StdDev and I would be close enough.

推荐答案

您可以从您所设定的(人口)得到一个平均值(意味着)。精度将您的样品多少而变化(如标准偏差确定确定或方差)。

You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).

的好处是,你有几十亿的观察,而你只需要品尝其中的一小部分得到一个体面的准确性或的confidence范围你的选择。如果条件合适,这减少了你会做的工作量。

The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.

下面是一个的数字图书馆为C#,其中包括一个随机序列发生器。只是使参照索引的元素的阵列中数字的随机序列(从1到的 X 的,元件的阵列中的数量)。解引用得到的值,然后计算你的平均值和标准差。

Here's a numerical library for C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.

如果你想测试你的数据的分布情况,考虑使用的卡方适合检验或的 KS 测试,你会发现在许多电子表格和统计软件包(如的研究)。这将有助于确定这种方法是否可用与否。

If you want to test the distribution of your data, consider using the Chi-Squared Fit test or the K-S test, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.

这篇关于如何找到在一个大组数字的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆