使用 Scipy (Python) 将经验分布拟合到理论分布? [英] Fitting empirical distribution to theoretical ones with Scipy (Python)?

查看:27
本文介绍了使用 Scipy (Python) 将经验分布拟合到理论分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

介绍:我有一个包含 30,000 多个整数值的列表,范围从 0 到 47(含),例如[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] 从一些连续分布中采样.列表中的值不一定按顺序排列,但顺序对这个问题没有影响.

问题:根据我的分布,我想计算任何给定值的 p 值(看到更大值的概率).例如,如您所见,0 的 p 值将接近 1,而较大数字的 p 值将趋于 0.

我不知道我是否正确,但为了确定概率,我认为我需要将我的数据拟合到最适合描述我的数据的理论分布.我假设需要某种拟合优度来确定最佳模型.

有没有办法在 Python 中实现这样的分析(ScipyNumpy)?你能举出一些例子吗?

解决方案

Distribution Fitting with Sum of Square Error (SSE)

这是对

最佳拟合分布

示例代码

%matplotlib 内联进口警告将 numpy 导入为 np将熊猫导入为 pd将 scipy.stats 导入为 st将 statsmodels.api 导入为 sm从 scipy.stats._continuous_distns 导入 _distn_names导入 matplotlib导入 matplotlib.pyplot 作为 pltmatplotlib.rcParams['figure.figsize'] = (16.0, 12.0)matplotlib.style.use('ggplot')# 从数据创建模型def best_fit_distribution(data, bins=200, ax=None):"通过找到数据的最佳拟合分布来对数据建模""# 获取原始数据的直方图y, x = np.histogram(data, bins=bins, density=True)x = (x + np.roll(x, -1))[:-1]/2.0# 最佳持有人best_distributions = []# 从数据估计分布参数对于 ii,在 enumerate 中的分布([d for d in _distn_names if not d in ['levy_stable', 'studentized_range']]):打印({:> 3}/{:<3}:{}".格式(ii+1,len(_distn_names),分布))分布 = getattr(st, 分布)# 尝试拟合分布尝试:# 忽略无法拟合的数据的警告使用 warnings.catch_warnings():warnings.filterwarnings('忽略')# 将 dist 拟合到数据参数 = 分布.拟合(数据)# 分离部分参数参数 = 参数[:-2]loc = 参数[-2]比例=参数[-1]# 计算拟合 PDF 和分布拟合误差pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)sse = np.sum(np.power(y - pdf, 2.0))# 如果轴传入添加到绘图尝试:如果斧头:pd.Series(pdf, x).plot(ax=ax)结尾除了例外:经过# 确定这个分布是否更好best_distributions.append((distribution, params, sse))除了例外:经过返回排序(best_distributions,key=lambda x:x[2])def make_pdf(dist, params, size=10000):"生成分布的概率分布函数""# 分离部分参数参数 = 参数[:-2]loc = 参数[-2]比例=参数[-1]# 获取合理的分布起点和终点start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)# 构建PDF转成pandas系列x = np.linspace(开始,结束,大小)y = dist.pdf(x, loc=loc, scale=scale, *arg)pdf = pd.Series(y, x)返回pdf# 从 statsmodels 数据集加载数据数据 = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())# 绘制比较plt.figure(figsize=(12,8))ax = data.plot(kind='hist', bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])# 保存绘图限制dataYLim = ax.get_ylim()# 找到最佳拟合分布best_distibutions = best_fit_distribution(data, 200, ax)best_dist = best_distibutions[0]# 更新图ax.set_ylim(dataYLim)ax.set_title(u'El Niño sea temp.
 All Fitted Distributions')ax.set_xlabel(u'Temp (°C)')ax.set_ylabel('频率')# 用最好的参数制作PDFpdf = make_pdf(best_dist[0], best_dist[1])# 展示plt.figure(figsize=(12,8))ax = pdf.plot(lw=2, label='PDF', legend=True)data.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])dist_str = '{}({})'.format(best_dist[0].name, param_str)ax.set_title(u'El Niño 海温.最佳拟合分布 
' + dist_str)ax.set_xlabel(u'Temp. (°C)')ax.set_ylabel('频率')

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn't matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don't know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

解决方案

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo's answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution's histogram and the data's histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Best holders
    best_distributions = []

    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ['levy_stable', 'studentized_range']]):

        print("{:>3} / {:<3}: {}".format( ii+1, len(_distn_names), distribution ))

        distribution = getattr(st, distribution)

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')
                
                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                
                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        
        except Exception:
            pass

    
    return sorted(best_distributions, key=lambda x:x[2])

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, density=True, alpha=0.5, color=list(matplotlib.rcParams['axes.prop_cycle'])[1]['color'])

# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_distibutions = best_fit_distribution(data, 200, ax)
best_dist = best_distibutions[0]

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.
 All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution 
' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

这篇关于使用 Scipy (Python) 将经验分布拟合到理论分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆