给定来自 Python 分布的样本列表,如何计算值的概率? [英] How to compute the probability of a value given a list of samples from a distribution in Python?

查看:39
本文介绍了给定来自 Python 分布的样本列表,如何计算值的概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不确定这是否属于统计数据,但我正在尝试使用 Python 来实现这一点.我基本上只有一个整数列表:

data = [300,244,543,1011,300,125,300 ... ]

我想知道给定这些数据的值出现的概率.我使用 matplotlib 绘制了数据的直方图并获得了这些:

在第一个图中,数字表示序列中的字符数.在第二张图中,它是以毫秒为单位测量的时间量.最小值大于零,但不一定有最大值.这些图表是使用数百万个示例创建的,但我不确定我是否可以对分布做出任何其他假设.鉴于我有几百万个值示例,我想知道新值的概率.在第一张图中,我有几百万个不同长度的序列.例如,想知道 200 长度的概率.

我知道对于连续分布,任何精确点的概率都应该为零,但是给定一个新值流,我需要能够说出每个值的可能性.我已经查看了一些 numpy/scipy 概率密度函数,但是我不确定在运行 scipy.stats.norm.pdf(data) 之类的东西后,从哪个中选择或如何查询新值.似乎不同的概率密度函数会以不同的方式拟合数据.鉴于直方图的形状,我不确定如何决定使用哪个.

解决方案

由于你脑子里似乎没有具体的分布,但你可能有很多数据样本,我建议使用非参数密度估计方法.您描述的数据类型之一(以毫秒为单位的时间)显然是连续的,而对连续随机变量的概率密度函数 (PDF) 进行非参数估计的一种方法是您已经提到的直方图.但是,正如您将在下面看到的,

如您所见,在这种情况下,直方图近似的 PDF 不是很有用,而 KDE 提供了更好的估计.然而,如果有更多的数据样本和正确选择 bin 大小,直方图也可能会产生一个很好的估计.

您可以在 KDE 情况下调整的参数是 内核带宽.您可以将核视为估计 PDF 的构建块,Scikit Learn 中提供了几个核函数:gaussian、tophat、epanechnikov、exponential、linear、cosine.更改带宽允许您调整偏差-方差权衡.更大的带宽会导致偏差增加,如果您的数据样本较少,这很好.较小的带宽会增加方差(包含在估计中的样本较少),但在更多样本可用时会提供更好的估计.

计算概率

对于 PDF,概率是通过计算一系列值的积分来获得的.正如您所注意到的,这将导致特定值的概率为 0.

Scikit Learn 似乎没有用于计算概率的内置函数.然而,很容易估计 PDF 在一个范围内的积分.我们可以通过在范围内多次评估 PDF 并将获得的值乘以每个评估点之间的步长求和来实现.在下面的示例中,N 个样本是通过步骤 step 获得的.

# 获取取值范围的概率start = 5 # 范围的开始end = 6 # 范围结束N = 100 # 评估点数step = (end - start)/(N - 1) # 步长x = np.linspace(start, end, N)[:, np.newaxis] # 生成范围内的值kd_vals = np.exp(kd.score_samples(x)) # 获取每个 x 的 PDF 值概率 = np.sum(kd_vals * step) # 近似 PDF 的积分打印(概率)

请注意,kd.score_samples 会生成数据样本的对数似然.因此,需要np.exp来获得似然性.

可以使用内置的 SciPy 积分方法执行相同的计算,这将给出更准确的结果:

from scipy.integrate import quad概率 = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]

例如,对于一次运行,第一种方法计算的概率为0.0859024655305,而第二种方法产生的概率为0.0850974209996139.

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,300,125,300 ... ]

And I would like to know the probability of a value occurring given this data. I graphed histograms of the data using matplotlib and obtained these:

In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.

I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.

解决方案

Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.

Estimating Density

The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity

# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)

# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)

# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)

# Show the plots
plt.show()

This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:

As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.

The parameters you can tune in case of KDE are the kernel and the bandwidth. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.

Calculating Probability

For a PDF, probability is obtained by calculating the integral over a range of values. As you noticed, that will lead to the probability 0 for a specific value.

Scikit Learn does not seem to have a builtin function for calculating probability. However, it is easy to estimate the integral of the PDF over a range. We can do it by evaluating the PDF multiple times within the range and summing the obtained values multiplied by the step size between each evaluation point. In the example below, N samples are obtained with step step.

# Get probability for range of values
start = 5  # Start of the range
end = 6    # End of the range
N = 100    # Number of evaluation points 
step = (end - start) / (N - 1)  # Step size
x = np.linspace(start, end, N)[:, np.newaxis]  # Generate values in the range
kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
print(probability)

Please note that kd.score_samples generates log-likelihood of the data samples. Therefore, np.exp is needed to obtain likelihood.

The same computation can be performed using builtin SciPy integration methods, which will give a bit more accurate result:

from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]

For instance, for one run, the first method calculated the probability as 0.0859024655305, while the second method produced 0.0850974209996139.

这篇关于给定来自 Python 分布的样本列表,如何计算值的概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆