如何找到真实数据的概率分布和参数?(蟒蛇 3) [英] How to find probability distribution and parameters for real data? (Python 3)

查看:22
本文介绍了如何找到真实数据的概率分布和参数?(蟒蛇 3)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自 sklearn 的数据集,我绘制了 load_diabetes.target 数据的分布(即 load_diabetes.data 用于预测).

I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the load_diabetes.data are used to predict).

我使用它是因为它具有最少数量的回归sklearn.datasets变量/属性.

I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets.

使用 Python 3,如何获得最相似的分布类型和参数?

Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?

我所知道的 target 值都是正偏斜(正偏斜/右偏斜)...Python 中是否有一种方法可以提供一些分布,然后为 target 数据/向量提供最佳拟合?或者,根据给定的数据实际提出合适的建议?对于具有理论统计知识但很少将其应用于真实数据"的人来说,这将非常有用.

All I know the target values are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the target data/vector? OR, to actually suggest a fit based on the data that's given? That would be realllllly useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".

奖金使用这种类型的方法来计算真实数据"的后验分布是否有意义?如果没有,为什么不呢?

Bonus Would it make sense to use this type of approach to figure out what your posterior distribution would be with "real data" ? If no, why not?

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

推荐答案

据我所知,没有自动获取样本分布类型和参数的方法(如推断样本的分布本身就是一个统计问题).

To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferring the distribution of a sample is a statistical problem by itself).

在我看来,您能做的最好的事情是:

In my opinion, the best you can do is:

(对于每个属性)

  • Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)? for an example with Scipy)

评估您的所有合身度并选择最合适的一个.这可以通过在您的样本和拟合的每个分布之间执行 Kolmogorov-Smirnov 检验来完成(再次在 Scipy 中实现),并选择最小化 D、检验统计量的那个(也就是样本和拟合).

Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimises D, the test statistic (a.k.a. the difference between the sample and the fit).

奖励:这是有道理的 - 因为您将在为每个变量选择一个拟合时为每个变量构建一个模型 - 尽管您的预测的优劣取决于您的数据质量和分布用于装修.毕竟,您正在构建模型.

Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

这篇关于如何找到真实数据的概率分布和参数?(蟒蛇 3)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆