使用python计算不适合内存的数据的均值和标准差 [英] calculating mean and standard deviation of the data which does not fit in memory using python

查看:41
本文介绍了使用python计算不适合内存的数据的均值和标准差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将大量数据存储在磁盘中的大型阵列中.我无法完全加载内存中的所有内容.

I have a lot of data stored at disk in large arrays. I cant load everything in memory altogether.

如何计算平均值和标准差?

How one could calculate the mean and the standard deviation?

推荐答案

有一个简单的在线算法 通过查看每个数据点一次并使用 O(1) 内存来计算均值和方差.

There is a simple online algorithm that computes both the mean and the variance by looking at each datapoint once and using O(1) memory.

维基百科提供以下代码:

def online_variance(data):
    n = 0
    mean = 0
    M2 = 0

    for x in data:
        n = n + 1
        delta = x - mean
        mean = mean + delta/n
        M2 = M2 + delta*(x - mean)

    variance = M2/(n - 1)
    return variance

该算法也称为韦尔福德方法.与其他答案中建议的方法不同,它可以显示为 不错的数值属性.

This algorithm is also known as Welford's method. Unlike the method suggested in the other answer, it can be shown to have nice numerical properties.

取方差的平方根得到标准偏差.

Take the square root of the variance to get the standard deviation.

这篇关于使用python计算不适合内存的数据的均值和标准差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆