Matplotlib:如何制作具有相等面积的垃圾箱的直方图? [英] Matplotlib: How to make a histogram with bins of equal area?

查看:99
本文介绍了Matplotlib:如何制作具有相等面积的垃圾箱的直方图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一些遵循任意分布的数字列表,如何定义matplotlib.pyplot.hist()的bin位置,以使每个bin中的面积等于(或接近)某个恒定面积A?面积应通过将垃圾箱中的项目数乘以垃圾箱的宽度计算得出,其值应不大于A.

Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.

这是一个MWE,用于显示带有正态分布样本数据的直方图:

Here is a MWE to display a histogram with normally distributed sample data:

import matplotlib.pyplot as plt
import numpy as np

x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()

此处bin_pos是代表垃圾箱边界位置的列表(请参阅相关问题这里.

Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.

推荐答案

我发现这个问题很有趣.解决方案取决于您是要绘制密度函数还是真实直方图.事实证明后一种情况更具挑战性. 此处是有关直方图和密度函数之间差异的更多信息.

I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.

这将完成您想要的密度函数:

This will do what you want for a density function:

def histedges_equalN(x, nbin):
    npt = len(x)
    return np.interp(np.linspace(0, npt, nbin + 1),
                     np.arange(npt),
                     np.sort(x))

x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)

请注意使用normed=True,它指定我们正在计算和绘制密度函数.在这种情况下,面积相等(您可以通过查看n * np.diff(bins)进行检查).另外请注意,此解决方案涉及查找具有相同点数的垃圾箱.

Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.

这是一个为直方图提供近似相等面积的框的解决方案:

Here is a solution that gives approximately equal area boxes for a histogram:

def histedges_equalA(x, nbin):
    pow = 0.5
    dx = np.diff(np.sort(x))
    tmp = np.cumsum(dx ** pow)
    tmp = np.pad(tmp, (1, 0), 'constant')
    return np.interp(np.linspace(0, tmp.max(), nbin + 1),
                     tmp,
                     np.sort(x))

n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)

这些框不是全部相等.特别是第一个和最后一个往往比其他的大30%.这是数据稀疏分布在正态分布尾部的产物,我相信只要它们是数据集中的稀疏人口区域,它就会持续存在.

These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.

旁注:我稍微玩了pow值,发现大约0.56的值具有较低的

Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).

据我所知,不可能完全解决此问题.这是因为它对数据离散化很敏感.例如,假设数据集中的第一个点是离群值-13,下一个值是-3,如该图像中的红点所示:

As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:

现在假设直方图的总区域"为150,并且您需要10个bin.在这种情况下,每个直方图条形图的面积应约为15,但您无法到达该位置,因为一旦条形图包含第二点,其面积就会从10跳到20.也就是说,数据不允许该条形图使其具有介于10到20之间的面积.一种解决方案可能是调整盒子的下边界以增加其面积,但是这种方法开始变得任意,并且如果此间隙"位于框的中间,则不起作用.数据集.

Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.

这篇关于Matplotlib:如何制作具有相等面积的垃圾箱的直方图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆