如何有效地计算运行标准偏差? [英] How to efficiently calculate a running standard deviation?

查看:91
本文介绍了如何有效地计算运行标准偏差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数字列表,例如:

I have an array of lists of numbers, e.g.:

[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
     ...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)

我想做的是有效地计算所有数组元素在列表的每个索引处的均值和标准差.

What I would like to do is efficiently calculate the mean and standard deviation at each index of a list, across all array elements.

要做的是,我一直在遍历数组并求和列表中给定索引处的值.最后,我将平均值列表"中的每个值除以n(我正在处理总体,而不是总体中的样本).

To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).

要计算标准差,我现在要遍历一下,现在我已经计算出平均值了.

To do the standard deviation, I loop through again, now that I have the mean calculated.

我想避免两次遍历数组,一次是平均值,然后是SD(在我有了平均值之后).

I would like to avoid going through the array twice, once for the mean and then once for the SD (after I have a mean).

是否有一种有效的方法来计算两个值,而只需要遍历数组一次?任何使用解释语言(例如Perl或Python)或伪代码的代码都可以.

Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g. Perl or Python) or pseudocode is fine.

推荐答案

答案是使用韦尔福德算法,该算法在以下天真的方法"之后非常明确地定义:

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

与其他响应中建议的两次通过或在线简单平方和收集器相比,它在数值上更稳定.只有当您拥有许多彼此接近的值时,稳定性才真正重要,因为它们导致所谓的"灾难性取消".

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

在方差计算(平方偏差)中,您可能还想了解除以样本数(N)和N-1之间的差异.除以N-1会得出样本方差的无偏估计,而平均除以N会低估方差(因为它没有考虑样本均值和真实均值之间的方差).

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

我在该主题上写了两个博客条目,其中包含更多详细信息,包括如何在线删除以前的值:

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

您也可以看看我的Java工具; Javadoc,源代码和单元测试都在线:

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

  • Javadoc: stats.OnlineNormalEstimator
  • Source: stats.OnlineNormalEstimator.java
  • JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
  • LingPipe Home Page

这篇关于如何有效地计算运行标准偏差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆