如何从直方图计算标准偏差? (Python,Matplotlib) [英] How to calculate the standard deviation from a histogram? (Python, Matplotlib)
问题描述
假设我有一个数据集,并使用matplotlib绘制了该数据集的直方图.
Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
如何使用hist()
返回的n
和bins
值计算标准偏差?我目前正在这样做以计算均值:
How do I calculate the standard deviation, using the n
and bins
values that hist()
returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
当我得到相当准确的结果时,它似乎工作正常.但是,如果我尝试像这样计算标准偏差:
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
我的结果与numpy.std(data)
返回的结果相去甚远.用每个垃圾箱的中心点代替左侧垃圾箱限制也不会改变这一点.我觉得问题在于n
和bins
值实际上不包含有关单个数据点如何在每个bin中分布的 any 信息,但是分配是明确要求我使用它们来计算标准偏差.
my results are way off from what numpy.std(data)
returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n
and bins
values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
推荐答案
您尚未使用n[i]
加权每个bin的贡献.将t
的增量更改为
You haven't weighted the contribution of each bin with n[i]
. Change the increment of t
to
t += n[i]*(bins[i] - mean)**2
通过使用 numpy.average
和weights
参数.
这是一个例子.首先,生成一些数据以供处理.在计算直方图之前,我们将计算输入的样本均值,方差和标准差.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
我将使用numpy.histogram
来计算直方图:
I'll use numpy.histogram
to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids
是垃圾箱的中点;它与n
的长度相同:
mids
is the midpoints of the bins; it has the same length as n
:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
平均值的估计是mids
的加权平均值:
The estimate of the mean is the weighted average of mids
:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
在这种情况下,它非常接近原始数据的平均值.
In this case, it is pretty close to the mean of the original data.
估计的方差是与均值的平方差的加权平均值:
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
该估算值在实际样品标准偏差的2%之内.
That estimate is within 2% of the actual sample standard deviation.
这篇关于如何从直方图计算标准偏差? (Python,Matplotlib)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!