matplotlib离散值的直方图 [英] Histogram for discrete values with matplotlib

查看:311
本文介绍了matplotlib离散值的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有时不得不使用matplotlib对离散值进行直方图绘制.在这种情况下,分档的选择可能至关重要:如果使用10个分箱直方图[0、1、2、3、4、5、6、7、8、9、10],则其中一个分箱将有两次与其他人一样重要.换句话说,binsize通常应为离散化大小的倍数.

虽然这种简单的情况我自己比较容易处理,但是没有人有一个指向可以自动处理此问题的库/函数的指针,包括在离散点大小可能略有变化的浮点数据的情况下是由于FP舍入?

谢谢.

解决方案

鉴于您的问题的标题,我将假设离散化大小是恒定的.

您可以找到此离散化大小(或者至少严格地是该大小的 n 倍,因为您的数据中可能没有两个相邻的样本)

np.diff(np.unique(data)).min()

这会找到数据(np.unique)中的唯一值,并找到它们之间的差异(np.diff).需要唯一性,这样您就不会得到零值.然后,您会找到最小的差异.离散化常数很小的情况可能会出现问题-我会再讨论这一点.

下一步-您希望您的值位于bin的中间-您当前的问题是因为9和10都在matplotlib自动提供的最后一个bin的边缘,所以您在一个bin中得到了两个样本. /p>

所以-试试这个:

import matplotlib.pyplot as plt
import numpy as np

data = range(11)
data = np.array(data)

d = np.diff(np.unique(data)).min()
left_of_first_bin = data.min() - float(d)/2
right_of_last_bin = data.max() + float(d)/2
plt.hist(data, np.arange(left_of_first_bin, right_of_last_bin + d, d))
plt.show()

这给出了:


小的非整数离散化

例如,我们可以制作更多测试数据集.

import random 

data = []
for _ in range(1000):
    data.append(random.randint(1,100))
data = np.array(data)
nasty_d = 1.0 / 597 #Arbitrary smallish discretization
data = data * nasty_d

如果您随后通过上面的数组运行该代码,然后查看代码吐出的d,您将看到

>>> print(nasty_d)
0.0016750418760469012
>>> print(d)
0.00167504187605

所以-检测到的d值不是创建数据所用的nasty_d的实际"值.但是-通过将垃圾箱移动d的一半以获取中间值的技巧- 除非 无关紧要,所以离散化非常小,因此在浮点数 的精度范围内,您有1000个bin,并且检测到的d和真实"离散化之间的差异可以累积到这一点垃圾箱之一缺少"数据点.这是需要注意的事情,但可能不会打到您.

以上示例的情节是


非均匀离散/最合适的分档...

对于更复杂的情况,您可能希望查看 解决方案

Given the title of your question, I will assume that the discretization size is constant.

You can find this discretization size (or at least, strictly, n times that size as you may not have two adjacent samples in your data)

np.diff(np.unique(data)).min()

This finds the unique values in your data (np.unique), finds the differences between then (np.diff). The unique is needed so that you get no zero values. You then find the minimum difference. There could be problems with this where discretization constant is very small - I'll come back to that.

Next - you want your values to be in the middle of the bin - your current issue is because both 9 and 10 are on the edges of the last bin that matplotlib automatically supplies, so you get two samples in one bin.

So - try this:

import matplotlib.pyplot as plt
import numpy as np

data = range(11)
data = np.array(data)

d = np.diff(np.unique(data)).min()
left_of_first_bin = data.min() - float(d)/2
right_of_last_bin = data.max() + float(d)/2
plt.hist(data, np.arange(left_of_first_bin, right_of_last_bin + d, d))
plt.show()

This gives:


Small non-integer discretization

We can make a bit more of a testing data set e.g.

import random 

data = []
for _ in range(1000):
    data.append(random.randint(1,100))
data = np.array(data)
nasty_d = 1.0 / 597 #Arbitrary smallish discretization
data = data * nasty_d

If you then run that through the array above and have a look at the d that the code spits out you will see

>>> print(nasty_d)
0.0016750418760469012
>>> print(d)
0.00167504187605

So - the detected value of d is not the "real" value of nasty_d that the data was created with. However - with the trick of shifting the bins by half of d to get the values in the middle - it shouldn't matter unless your discretization is very very small so your down in the limits of precision of a float or you have 1000s of bins and the difference between detected d and "real" discretization can build up to such a point that one of the bins "misses" the data point. It's something to be aware of, but probably won't hit you.

An example plot for the above is


Non uniform discretization / most appropriate bins...

For further more complex cases, you might like to look at this blog post I found. This looks at ways of automatically "learning" the best bin widths from (continuous / quasi-continuous) data, referencing multiple standard techniques such as Sturges' rule and Freedman and Diaconis' rule before developing its own Bayesian dynamic programming method.

If this is your use case - the question is far broader and may not be suited to a definitive answer on Stack Overflow, although hopefully the links will help.

这篇关于matplotlib离散值的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆