如何有效地计算运行标准差? [英] How to efficiently calculate a running standard deviation?

查看:47
本文介绍了如何有效地计算运行标准差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数字列表,例如:

I have an array of lists of numbers, e.g.:

[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
     ...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)

我想要做的是有效地计算所有数组元素的列表每个索引的均值和标准差.

What I would like to do is efficiently calculate the mean and standard deviation at each index of a list, across all array elements.

为了求平均值,我一直在遍历数组并对列表的给定索引处的值求和.最后,我将平均值列表"中的每个值除以 n(我使用的是总体,而不是总体中的样本).

To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).

为了计算标准偏差,我再次循环,现在我已经计算了平均值.

To do the standard deviation, I loop through again, now that I have the mean calculated.

我想避免两次遍历数组,一次是平均值,一次是 SD(在我有平均值之后).

I would like to avoid going through the array twice, once for the mean and then once for the SD (after I have a mean).

有没有一种有效的方法来计算这两个值,只遍历一次数组?任何使用解释性语言(例如 Perl 或 Python)或伪代码的代码都可以.

Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g. Perl or Python) or pseudocode is fine.

推荐答案

答案是使用 Welford's algorithm,在naive methods"之后定义的非常清楚:

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

它比其他响应中建议的两次通过或在线简单平方和收集器在数值上更稳定.只有当您有许多彼此接近的值时,稳定性才真正重要,因为它们会导致所谓的灾难性取消".

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

您可能还想复习方差计算中除以样本数 (N) 和 N-1 之间的差异(平方偏差).除以 N-1 会导致对样本方差的无偏估计,而除以平均数会低估方差(因为它没有考虑样本均值与真实均值之间的方差).

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

我写了两篇关于该主题的博客文章,其中详细介绍了如何在线删除以前的值:

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

你也可以看看我的Java工具;javadoc、源代码和单元测试都在线:

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

这篇关于如何有效地计算运行标准差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆