将数据点拟合到累积分布 [英] Fitting data points to a cumulative distribution

查看:78
本文介绍了将数据点拟合到累积分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将伽马分布拟合到我的数据点,我可以使用下面的代码来做到这一点.

I am trying to fit a gamma distribution to my data points, and I can do that using code below.

import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)

我想使用许多这样小的伽马分布来重建更大的分布(更大的分布与问题无关,只是证明我为什么要尝试拟合 cdf 而不是 pdf).

I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).

为了实现这一点,我想将累积分布而不是 pdf 拟合到我较小的分布数据中.—更准确地说,我只想将数据拟合到累积分布的一部分.

To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.

例如,我只想拟合数据,直到累积概率函数(具有一定的尺度和形状)达到 0.6.

For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.

对此使用 fit() 有什么想法吗?

Any thoughts on using fit() for this purpose?

推荐答案

我知道您正在尝试使用几个小的伽玛分布来分段重建您的 cdf,每个分布都有不同的尺度和形状参数,以捕获分布的局部"区域.

I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.

如果您的经验分布是多模态的/难以用一个全局"参数分布来概括,那么这可能是有意义的.

Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.

不知道您是否有专门拟合多个伽马分布的具体原因,但如果您的目标是尝试拟合一个相对平滑且能很好地捕获您的经验 cdf 的分布,也许您可​​以查看 Kernel Density估计.它本质上是一种将分布拟合到数据的非参数方法.

Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.

http://scikit-learn.org/stable/modules/density.htmlhttp://en.wikipedia.org/wiki/Kernel_density_estimation

例如,您可以尝试使用高斯内核并更改带宽参数以控制拟合的平滑程度.带宽太小会导致不平滑(过度拟合")结果 [高方差,低偏差].过大的带宽会导致非常平滑的结果,但具有很高的偏差.

For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.

from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints) 

然后选择平衡偏差-方差权衡的带宽参数的一个好方法是使用交叉验证.从本质上讲,高级思想是对数据进行分区,对训练集运行分析并在测试集上验证",这将防止过度拟合数据.

A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.

幸运的是,sklearn 还实现了一个使用交叉验证选择 Guassian 内核最佳带宽的好例子,您可以从中借用一些代码:

Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:

http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html

希望这会有所帮助!

这篇关于将数据点拟合到累积分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆