如何估算密度函数并计算其峰值? [英] How to estimate density function and calculate its peaks?

查看:430
本文介绍了如何估算密度函数并计算其峰值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始使用python进行分析.我想执行以下操作:

I have started to use python for analysis. I would like to do the following:

  1. 获取数据集的分布
  2. 获取此分布中的峰

我使用了来自scipy.stats的gaussian_kde来估计内核密度函数. guassian_kde是否对数据做任何假设?我正在使用随时间变化的数据.因此,如果数据具有一种分布(例如高斯分布),则以后可能具有另一种分布. gaussian_kde在这种情况下是否有任何缺点?在问题中提出了建议,以尝试适合在每个分布中获取数据以获取数据分布.因此,使用gaussian_kde和 question .我使用下面的代码,我想知道是否gaussian_kde是估计pdf数据是否随时间变化的好方法?我知道gaussian_kde的一个优点是它可以根据经验法则自动计算带宽,如

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df)) 
y = pdf(x)                          

pl.plot(x, y, color = 'r') 
pl.hist(data_column, normed= True)
pl.show(block=True)       

推荐答案

我认为您需要将非参数密度(在scipy.stats.kde中实现的一种)与参数密度(在

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)

# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')

从图中我们可以看到,非参数密度不过是直方图的平滑形式.在直方图中,对于特定观察值x=x0,我们用条形图表示它(将所有概率质量放在该单个点x=x0上,在其他位置为零),而在非参数密度估计中,我们使用钟形曲线(高斯核)代表该点(遍布其邻域).结果是平滑的密度曲线.此内部高斯内核与您对基础数据x的分布假设无关.其唯一目的是平滑.

From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

要获得非参数密度的模,我们需要进行详尽的搜索,因为不能保证密度为单模.如上例所示,如果准牛顿优化算法在[5,10]之间开始,则很可能以局部最优点而不是全局最优点结束.

To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

这篇关于如何估算密度函数并计算其峰值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆