如何在Python分布中给定样本列表的情况下计算值的概率? [英] How to compute the probability of a value given a list of samples from a distribution in Python?

查看:201
本文介绍了如何在Python分布中给定样本列表的情况下计算值的概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不确定这是否属于统计信息,但是我正在尝试使用Python来实现这一点.我基本上只有一个整数列表:

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,300,125,300 ... ]

我想知道在给定该数据的情况下值出现的可能性. 我使用matplotlib绘制了数据的直方图,并获得了以下数据:

And I would like to know the probability of a value occurring given this data. I graphed histograms of the data using matplotlib and obtained these:

在第一张图中,数字表示序列中字符的数量.在第二张图中,它是经过测量的时间量(以毫秒为单位).最小值大于零,但不一定有最大值.这些图是使用数百万个示例创建的,但是我不确定是否可以对分布进行任何其他假设.考虑到我有数百万个值的示例,我想知道一个新值的可能性.在第一个图中,我有几百万个不同长度的序列.例如,想知道长度为200的概率.

In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.

我知道对于连续分布,任何精确点的概率都应该为零,但是给定一系列新值,我必须能够说出每个值的可能性.我已经看过一些numpy/scipy概率密度函数,但是一旦运行scipy.stats.norm.pdf(data)之类的东西,我就不确定从哪个选择或如何查询新值.似乎不同的概率密度函数将以不同的方式拟合数据.给定直方图的形状,我不确定如何决定使用哪个.

I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.

推荐答案

由于您似乎没有特定的分布,但是您可能有很多数据样本,因此我建议使用非参数密度估算方法.您描述的一种数据类型(以毫秒为单位的时间)显然是连续的,并且您已经提到的直方图是用于连续随机变量的概率密度函数(PDF)的非参数估计的一种方法.但是,正如您将在下面看到的,内核密度估计(KDE)会更好.您描述的第二种数据类型(序列中的字符数)是离散类型的.在这种情况下,核密度估计也很有用,并且可以将其视为一种平滑技术,适用于您没有足够数量的离散变量所有值样本的情况.

Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.

下面的示例显示了如何首先从2个高斯分布的混合中生成数据样本,然后应用核密度估计来找到概率密度函数:

The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity

# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)

# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)

# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)

# Show the plots
plt.show()

这将产生以下图,其中真实分布显示为蓝色,直方图显示为绿色,而使用KDE估算的PDF显示为红色:

This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:

如您所见,在这种情况下,直方图近似的PDF不太有用,而KDE提供了更好的估计.但是,如果有更多的数据样本和适当的bin大小选择,则直方图也可能会产生良好的估计.

As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.

在KDE情况下可以调整的参数是内核带宽.您可以将内核视为估计的PDF的构建块,Scikit Learn中提供了几个内核功能:高斯,tophat,epanechnikov,指数,线性,余弦.更改带宽可以调整偏差方差.较大的带宽将导致偏差增加,如果数据样本较少,则很好.较小的带宽会增加方差(估计中包含更少的样本),但是当有更多样本可用时,将提供更好的估计.

The parameters you can tune in case of KDE are the kernel and the bandwidth. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.

对于PDF,通过计算值范围内的积分来获得概率.您已经注意到,这将导致特定值的概率为0.

For a PDF, probability is obtained by calculating the integral over a range of values. As you noticed, that will lead to the probability 0 for a specific value.

Scikit Learn似乎没有内置的函数来计算概率.但是,很容易估计一个范围内PDF的积分.我们可以通过在该范围内多次评估PDF并将求出的值乘以每个评估点之间的步长之和来实现此目的.在下面的示例中,在步骤step中获得了N个样本.

Scikit Learn does not seem to have a builtin function for calculating probability. However, it is easy to estimate the integral of the PDF over a range. We can do it by evaluating the PDF multiple times within the range and summing the obtained values multiplied by the step size between each evaluation point. In the example below, N samples are obtained with step step.

# Get probability for range of values
start = 5  # Start of the range
end = 6    # End of the range
N = 100    # Number of evaluation points 
step = (end - start) / (N - 1)  # Step size
x = np.linspace(start, end, N)[:, np.newaxis]  # Generate values in the range
kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
print(probability)

请注意,kd.score_samples会生成数据样本的对数似然.因此,需要np.exp来获得可能性.

Please note that kd.score_samples generates log-likelihood of the data samples. Therefore, np.exp is needed to obtain likelihood.

可以使用内置的SciPy集成方法执行相同的计算,这将给出更准确的结果:

The same computation can be performed using builtin SciPy integration methods, which will give a bit more accurate result:

from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]

例如,对于一次运行,第一种方法将概率计算为0.0859024655305,而第二种方法产生了0.0850974209996139.

For instance, for one run, the first method calculated the probability as 0.0859024655305, while the second method produced 0.0850974209996139.

这篇关于如何在Python分布中给定样本列表的情况下计算值的概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆