如何从直方图计算标准偏差? (Python,Matplotlib) [英] How to calculate the standard deviation from a histogram? (Python, Matplotlib)

查看:1230
本文介绍了如何从直方图计算标准偏差? (Python,Matplotlib)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数据集,并使用matplotlib绘制了该数据集的直方图.

Let's say I have a data set and used matplotlib to draw a histogram of said data set.

n, bins, patches = plt.hist(data, normed=1)

如何使用hist()返回的nbins值计算标准偏差?我目前正在这样做以计算均值:

How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:

s = 0
for i in range(len(n)):
   s += n[i] * ((bins[i] + bins[i+1]) / 2) 
mean = s / numpy.sum(n)

当我得到相当准确的结果时,它似乎工作正常.但是,如果我尝试像这样计算标准偏差:

which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:

t = 0
for i in range(len(n)):
  t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))

我的结果与numpy.std(data)返回的结果相去甚远.用每个垃圾箱的中心点代替左侧垃圾箱限制也不会改变这一点.我觉得问题在于nbins值实际上不包含有关单个数据点如何在每个bin中分布的 any 信息,但是分配是明确要求我使用它们来计算标准偏差.

my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.

推荐答案

您尚未使用n[i]加权每个bin的贡献.将t的增量更改为

You haven't weighted the contribution of each bin with n[i]. Change the increment of t to

    t += n[i]*(bins[i] - mean)**2

通过使用 numpy.average weights参数.

这是一个例子.首先,生成一些数据以供处理.在计算直方图之前,我们将计算输入的样本均值,方差和标准差.

Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.

In [54]: x = np.random.normal(loc=10, scale=2, size=1000)

In [55]: x.mean()
Out[55]: 9.9760798903061847

In [56]: x.var()
Out[56]: 3.7673459904902025

In [57]: x.std()
Out[57]: 1.9409652213499866

我将使用numpy.histogram来计算直方图:

I'll use numpy.histogram to compute the histogram:

In [58]: n, bins = np.histogram(x)

mids是垃圾箱的中点;它与n的长度相同:

mids is the midpoints of the bins; it has the same length as n:

In [59]: mids = 0.5*(bins[1:] + bins[:-1])

平均值的估计是mids的加权平均值:

The estimate of the mean is the weighted average of mids:

In [60]: mean = np.average(mids, weights=n)

In [61]: mean
Out[61]: 9.9763028267760312

在这种情况下,它非常接近原始数据的平均值.

In this case, it is pretty close to the mean of the original data.

估计的方差是与均值的平方差的加权平均值:

The estimated variance is the weighted average of the squared difference from the mean:

In [62]: var = np.average((mids - mean)**2, weights=n)

In [63]: var
Out[63]: 3.8715035807387328

In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677

该估算值在实际样品标准偏差的2%之内.

That estimate is within 2% of the actual sample standard deviation.

这篇关于如何从直方图计算标准偏差? (Python,Matplotlib)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆