如何找到真实数据的概率分布和参数? (Python 3) [英] How to find probability distribution and parameters for real data? (Python 3)

查看:878
本文介绍了如何找到真实数据的概率分布和参数? (Python 3)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自sklearn的数据集,并且绘制了load_diabetes.target数据的分布图(即load_diabetes.data用于预测的回归值).

I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the load_diabetes.data are used to predict).

之所以使用它,是因为它具有回归sklearn.datasets的变量/属性最少的数目.

I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets.

使用Python 3,如何获得最相似的分布类型和分布参数?

Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?

所有我知道的target值都是正偏斜的(正偏斜/右偏斜). . . Python中是否有办法提供一些分布,然后最适合target数据/向量?或者,根据给定的数据实际建议适合度?对于那些具有理论统计知识但很少将其应用于真实数据"的经验的人来说,这将是非常有用的.

All I know the target values are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the target data/vector? OR, to actually suggest a fit based on the data that's given? That would be realllllly useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".

奖金 使用这种方法找出真实数据"的后验分布会有意义吗?如果没有,为什么不呢?

Bonus Would it make sense to use this type of approach to figure out what your posterior distribution would be with "real data" ? If no, why not?

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

推荐答案

没有自动方法来获取样本的分布类型和参数(如推断样本的分布本身就是一个统计问题.

To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferring the distribution of a sample is a statistical problem by itself).

我认为,您能做的最好的事情是:

In my opinion, the best you can do is:

(针对每个属性)

  • 尝试使每个属性适合相当大的可能分布列表 (例如,请参见 Scipy(Python)?以Scipy为例)

  • Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)? for an example with Scipy)

评估您的所有健康状况并选择最合适的.这可以通过在样本与拟合的每个分布之间执行Kolmogorov-Smirnov检验(再次在Scipy中实现),然后选择一个将检验统计量D最小化的方法来完成(也就是样本和拟合).

Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimises D, the test statistic (a.k.a. the difference between the sample and the fit).

奖金:这很有道理-当您为每个变量选择适合的模型时,您将在每个变量上建立模型-尽管预测的优劣将取决于数据的质量和分布正在用于拟合.毕竟,您正在建立模型.

Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

这篇关于如何找到真实数据的概率分布和参数? (Python 3)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆