绘制数据点在分布中的位置 [英] Plotting data points on where they fall in a distribution

查看:158
本文介绍了绘制数据点在分布中的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有一个很大的数据集,可以在某种分析中对所有数据进行操作.可以查看概率分布中的值.

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.

现在我有了这个大数据集,然后我想将已知的实际数据与其进行比较.首先,我的数据集中有多少个值与已知数据具有相同的值或属性.例如:

Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:

这是累积分布.实线来自模拟生成的数据,而强度下降只是预测的百分比.然后将星星作为观测(已知)数据,并与生成的数据作图.

This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.

我提出的另一个示例是在视觉上如何将这些点投影到直方图上:

Another example I have made is how visually the points could possibly be projected on a histogram:

我很难标记已知数据点在生成的数据集中的位置,并沿着生成的数据的分布对其进行累积绘制.

I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.

如果我尝试检索在生成的数据附近的点数,我将像这样开始(不正确):

If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):

def SameValue(SimData, DefData, uncert):
     numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
     return sum(numb)

但是我无法解释落在值范围内的点,然后将它们全部设置到可以绘制的位置.是否有关于如何收集这些数据并将其投影到累积分布上的想法?

But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?

推荐答案

问题非常混乱,包含大量不相关的信息,但在关键点上仍然含糊不清.我会尽力解释它.

The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.

我想您要得到的是:给定未知分布中的有限样本,以固定值获得新样本的概率是多少?

I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?

我不确定是否有一个普遍的答案,但是无论如何,这将是统计学或数学界人士要问的一个问题.我的猜测是,您将需要对分布本身进行一些假设.

I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.

但是,在实际情况下,找出新值将位于采样分布的哪个bin中就足够了.

For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.

因此,假设我们有一个分布x,我们将其划分为bins.我们可以使用numpy.histogram计算直方图h.然后,在每个仓中找到值的概率由h/h.sum()给出.
有了一个v=0.77值,我们想根据分布来知道该概率,我们可以通过在bin数组中查找需要该值的索引ind来找出它所属的bin.插入以使数组保持排序.可以使用 numpy.searchsorted 完成.

So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.

import numpy as np; np.random.seed(0)

x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())

ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058

因此在0.77左右的bin中采样值的概率为5.8%.

So the probability is 5.8% to sample a value in the bin around 0.77.

另一种选择是在bin中心之间插入直方图,以找到概率.

A different option would be to interpolate the histogram between the bin centers, as to find the the probability.

在下面的代码中,我们绘制了与问题图片相似的分布,并使用了两种方法,第一种用于频率直方图,第二种用于累积分布.

In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt

x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())

points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")

kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)

cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)

for p, m, l, c in zip(points, markers, labels, colors):
    kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
    # plot points in scatter distribution
    ax.plot(p[0],p[1], **kw)
    #plot points in bar histogram, find bin in which to plot point
    # shift by half the bin width to plot it in the middle of bar
    pix = np.searchsorted(bins, p[0], side="right")
    axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
    # plot in cumulative histogram, interpolate, such that point is on curve.
    yi = np.interp(p[0], cbins, hcumc)
    axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()  
plt.show()

这篇关于绘制数据点在分布中的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆