一个非常庞大的数据集的平均值和标准偏差 [英] Mean value and standard deviation of a very huge data set

查看:29
本文介绍了一个非常庞大的数据集的平均值和标准偏差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一种算法可以计算未绑定数据集的平均值和标准偏差.

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set.

例如,我正在监控一个测量值,比如电流.我想要所有历史数据的平均值.每当出现新值时,更新均值和标准差?因为数据太大无法存储,我希望它可以在不存储数据的情况下动态更新均值和标准差.

for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical data. Whenever a new value come, update the mean and stdev? Because the data is too big to store, I hope it can just update the mean and stdev on the fly without storing the data.

即使存储数据,标准方式 (d1+...+dn)/n 也行不通,总和会导致数据表示溢出.

Even data is stored, the standard way (d1+...+dn)/n, doesn't work, the sum will blow out the data representation.

I 通过大约 sum(d1/n + d2/n + ... d3/n),如果 n 是 hugh,则误差太大且累积.此外,在这种情况下,n 是未绑定的.

I through about sum(d1/n + d2/n + ... d3/n), if n is hugh, the error is too big and accumulated. Besides, n is unbound in this case.

数据的数量肯定是无限制的,每次来都需要更新值.

The number of data is definitely unbound, whenever it comes, it requires to update the value.

有人知道是否有算法吗?

Does anybody know if there is an algorithm for it?

推荐答案

[问题改变了吗?也许我只看了开头.我已更新/编辑以提供更好的回复:]

[did the question change? maybe i only read the start. i have updated/edited to give a better reply:]

我知道没有完美的解决方案(在恒定内存中),但我可以提供各种方法.

there is no perfect solution (in constant memory) that i know of, but i can give various approaches.

首先,对于基本计算,您只需要所有值的总和 (sum_x)、它们的平方和 (sum_x2) 和总计数 (<代码>n).然后:

first, for the basic calculation you only need the sum of all values (sum_x), the sum of their squares (sum_x2), and the total count (n). then:

mean = sum_x / n
stdev = sqrt( sum_x2/n - mean^2 )

并且所有这些值(sum_xsum_x2n)都可以从流中更新.

and all these values (sum_x, sum_x2, n) can be updated from a stream.

问题(如您所说)是处理溢出和/或有限的精度.当 sum_x2 太大以至于内部表示不包括单个平方值的大小值时,如果您考虑浮点数,您就会看到这一点.

the problem (as you say) is dealing with overflow and / or limited precision. you can see this if you consider floating point when sum_x2 is so large that the internal representation doesn't include values of the magnitude of a single squared value.

避免该问题的一个简单方法是使用精确算术,但这会越来越慢(并且还会使用 O(log(n)) 内存).

a simple way to avoid the problem is to use exact arithmetic, but that will be increasingly slow (and also uses O(log(n)) memory).

另一种有用的方法是规范化值 - 如果您知道值通常是 X,那么您可以对 x - X 进行计算,这使得总和更小(显然你然后将 X 添加到平均值中!).这有助于推迟精度丢失的点(并且可以/应该与此处的任何其他方法结合使用 - 例如,在分箱时,您可以使用前一个分箱的平均值).请参阅此算法(knuth 的方法)了解如何逐步执行此操作.

another way that can help is to normalise values - if you know that values are typically X then you can do calculations on x - X which makes the sums smaller (obviously you then add X to the mean!). that helps postpone the point at which precision is lost (and can/should be combined with any other method here - when binning, for example, you can use the mean of the previous bin). see this algorithm (knuth's method) for how to do this progressively.

如果您不介意(小常数因子)O(n) 内存成本,您可以重新启动每个 N 值(例如百万 - 更聪明的仍然是通过检测精度何时过低来调整此值),存储先前的平均值和标准差,然后组合以获得最终结果(因此您的平均值是最近运行总计和旧分箱值的适当加权值).

if you don't mind a (small constant factor) O(n) memory cost you could restart every N values (eg million - smarter still would be to adapt this value by detecting when precision is too low), storing previous mean and stdev and then combine for the final result (so your mean is the appropriately weighted value from the recent running total and the old binned values).

分箱方法可能会推广到多个级别(您开始分箱)并将减少到 O(log(n)) 内存使用,但我还没有弄清楚细节.

the binning approach could probably be generalised to multiple levels (you start binning the bins) and would reduce to O(log(n)) memory use, but i haven't worked out the details.

最后,更实用的解决方案可能是对 1000 个值进行初始方法,然后并行开始新的总和.您可以显示两者的加权平均值,然后再取 1000 个值,去掉旧的总和(逐渐减少它们的权重后)并开始一个新的集合.所以你总是有两组总和,并显示它们之间的加权平均值,这给出了反映最后 1000 个值的连续数据(仅).在某些情况下,这已经足够好了,我想(这不是一个精确的值,因为它仅用于最近"的数据,但它是平滑且具有代表性的,并且使用固定数量的内存).

finally, a more practical solution would probably be to do the initial approach for, say, 1000 values, and then start a new sum in parallel. you could display a weighted average of the two and, after another 1000 values, drop the old sums (after gradually decreasing their weight) and start a new set. so you always have two sets of sums, and display a weighted average between them, which gives continuous data that reflect the last 1000 values (only). in some cases that will be good enough, i imagine (it's not an exact value, since it's only for "recent" data, but it's smooth and representative, and uses a fixed amount of memory).

ps,我后来想到的事情 - 实际上,无论如何,永远"这样做并没有多大意义,因为您将到达值完全由旧数据主导的地步.最好使用运行平均值",这会降低旧值的权重.参见例如 https://en.wikipedia.org/wiki/Moving_average - 但是,我不知道通用等价于 stdev.

ps, something that occurred to me later - really, doing this "forever" doesn't make much sense anyway, because you'll get to a point where the values are absolutely dominated by the old data. it would be better to use a "running mean" which gives decreased weight to old values. see for example https://en.wikipedia.org/wiki/Moving_average - however, i don't know of a common equivalent to stdev.

这篇关于一个非常庞大的数据集的平均值和标准偏差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆