Python:两个函数之间的重叠(kde 和 normal 的 PDF) [英] Python: Overlap between two functions (PDF of kde and normal)

查看:142
本文介绍了Python:两个函数之间的重叠(kde 和 normal 的 PDF)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短摘要:我想弄清楚如何计算两个函数之间的重叠.一个是高斯分布,另一个是基于数据的核密度.然后,我想做一个小算法,选择高斯的均值和方差,最大化重叠

首先,需要导入:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.kde import gaussian_kde
import scipy

我有一些大致正常的数据(右尾有点重).我计算这些数据的核密度、cdf 和 pdf(在这个例子中,数据是从一个统一的,因为我不能提供真实的数据)像这样:

I have some data which is approximately normal (somewhat heavy right-tail). I calculate the kernel density, cdf and pdf of this data (in this example, the data is drawn from a uniform, as i can't supply the real data) like this:

def survivalFunction():

    data = np.random.normal(7,1,100) #Random data 

    p = sns.kdeplot(data, shade=False, lw = 3)
    x,y = p.get_lines()[0].get_data()
    cdf = scipy.integrate.cumtrapz(y, x, initial=0)

    plt.hist(data,50,normed = 1,facecolor='b',alpha = 0.3)

然后我有另一个函数,它只是一个简单的高斯:

Then i have another function, which is just a simple gaussian:

def surpriseFunction(mu,variance):

    hStates = np.linspace(0,20,100)
    sigma = math.sqrt(variance)

    plt.plot(hStates,scipy.stats.norm.pdf(hStates, mu, sigma))

调用函数

surpriseFunction(5,1)
survivalFunction()

给出这个情节

正如您可能已经注意到的那样,交换不同的 mu 值会围绕法线移动以与内核估计或多或少重叠.现在,我的问题是双重的:

As you might have noticed, exchanging different values of mu, moves around the normal to overlap more or less with the kernel estimation. Now, my question is twofold:

1) 如何计算两个函数之间的重叠?

2) 我将如何制作一个小算法,为高斯选择一个均值和方差,以最大化这种重叠?

推荐答案

好的,所以我做了一个相当大的改组,我认为它分离了主要部分,并且可以很容易地在各种功能中进行模块化/.我给出的上一个答案的原始代码是这里.

Okay, so i did a fairly major reshuffle, i think it separates out the major parts and will make it easy to make modular / in various functions. The orriginal code for the previous answer i gave is here.

这是新的东西,希望它是不言自明的.

Here's the new stuff, hopefully it's pretty self explanatory.

# Setup our various global variables
population_mean = 7
population_std_dev = 1
samples = 100
histogram_bins = 50

# And setup our figure.
from matplotlib import pyplot
fig = pyplot.figure()
ax = fig.add_subplot(1,1,1)


from numpy.random import normal  
hist_data = normal(population_mean, population_std_dev, samples)
ax.hist(hist_data, bins=histogram_bins, normed=True, color="blue", alpha=0.3)


from statsmodels.nonparametric.kde import KDEUnivariate
kde = KDEUnivariate(hist_data)
kde.fit()

#kde.supprt and kde.density hold the x and y values of the KDE fit.
ax.plot(kde.support, kde.density, color="red", lw=4)


#Gaussian function - though you can replace this with something of your choosing later.
from numpy import sqrt, exp, pi
r2pi = sqrt(2*pi)
def gaussian(x, mu, sigma):
  return exp(-0.5 * ( (x-mu) / sigma)**2) / (sigma * r2pi)

#interpolation of KDE to produce a function.
from scipy.interpolate import interp1d
kde_func = interp1d(kde.support, kde.density, kind="cubic", fill_value=0)

您想做的只是标准曲线拟合 - 有很多方法可以做到,并且您说您想通过最大化两个函数的重叠来拟合曲线(为什么?).curve_fir scipy 例程是最小二乘拟合,它试图最小化两个函数之间的差异 - 差异是微妙的:最大化重叠不会惩罚大于数据的拟合函数,而 curve_fit 确实如此.

What you want to do is just standard curve fitting - there are numerous ways to do it, and you say you want to fit the curve by maximizing the overlap of the two functions (why?). the curve_fir scipy routine is a least squares fit, which is trying to minimize the difference between the two functions - the difference is subtle: maximizing the overlap does not punish the fitting function for being larger than the data, whereas curve_fit does.

我已经包含了使用这两种技术的解决方案,并对它们进行了分析:

I've included solutions using both techniques, as well as profiled them:

#We need to *maximise* the overlap integral
from scipy.integrate import quad as integrate
def overlap(func1, func2, limits, func1_args=[], func2_args=[]):

  def product_func(x):
    return min(func1(x, *func1_args),func2(x, *func2_args))

  return integrate(product_func, *limits)[0] # we only care about the absolute result for now.

limits = hist_data.min(), hist_data.max()
def gaussian_overlap(args):
  mu, sigma = args
  return -overlap(kde_func, gaussian, limits, func2_args=[mu, sigma])

现在是两种不同的方法,重叠度量:

And now the two different methods, the overlap metric:

import cProfile, pstats, StringIO
pr1 = cProfile.Profile()
pr1.enable()

from scipy.optimize import fmin_powell as minimize
mu_overlap_fit, sigma_overlap_fit = minimize(gaussian_overlap, (population_mean, population_std_dev))

pr1.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr1, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()



   3122462 function calls in 6.298 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.298    6.298 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2120(fmin_powell)
        1    0.000    0.000    6.298    6.298 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2237(_minimize_powell)
       57    0.000    0.000    6.296    0.110 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:279(function_wrapper)
       57    0.000    0.000    6.296    0.110 C:\Users\Will\Documents\Python_scripts\hist_fit.py:47(gaussian_overlap)
       57    0.000    0.000    6.296    0.110 C:\Users\Will\Documents\Python_scripts\hist_fit.py:39(overlap)
       57    0.000    0.000    6.296    0.110 C:\Python27\lib\site-packages\scipy\integrate\quadpack.py:42(quad)
       57    0.000    0.000    6.295    0.110 C:\Python27\lib\site-packages\scipy\integrate\quadpack.py:327(_quad)
       57    0.069    0.001    6.295    0.110 {scipy.integrate._quadpack._qagse}
    66423    0.154    0.000    6.226    0.000 C:\Users\Will\Documents\Python_scripts\hist_fit.py:41(product_func)
        4    0.000    0.000    6.167    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:2107(_linesearch_powell)
        4    0.000    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1830(brent)
        4    0.000    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1887(_minimize_scalar_brent)
        4    0.001    0.000    6.166    1.542 C:\Python27\lib\site-packages\scipy\optimize\optimize.py:1717(optimize)

和scipy方法curve_fit:

pr2 = cProfile.Profile()
pr2.enable()

from scipy.optimize import curve_fit
(mu_curve_fit, sigma_curve_fit), _ = curve_fit(gaussian, kde.support, kde.density, p0=(population_mean, population_std_dev))

pr2.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr2, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()




   122 function calls in 0.001 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:452(curve_fit)
        1    0.000    0.000    0.001    0.001 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:256(leastsq)
        1    0.000    0.000    0.001    0.001 {scipy.optimize._minpack._lmdif}
       19    0.000    0.000    0.001    0.000 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:444(_general_function)
       19    0.000    0.000    0.000    0.000 C:\Users\Will\Documents\Python_scripts\hist_fit.py:29(gaussian)
        1    0.000    0.000    0.000    0.000 C:\Python27\lib\site-packages\scipy\linalg\basic.py:314(inv)
        1    0.000    0.000    0.000    0.000 C:\Python27\lib\site-packages\scipy\optimize\minpack.py:18(_check_func)

你可以看到curve_fit方法快得多,结果:

You can see the curve_fit method is much faster, and the results:

from numpy import linspace
xs = linspace(-1, 1, num=1000) * sigma_overlap_fit * 6 + mu_overlap_fit
ax.plot(xs, gaussian(xs, mu_overlap_fit, sigma_overlap_fit), color="orange", lw=2)

xs = linspace(-1, 1, num=1000) * sigma_curve_fit * 6 + mu_curve_fit
ax.plot(xs, gaussian(xs, mu_curve_fit, sigma_curve_fit), color="purple", lw=2)

pyplot.show()

非常相似.我会推荐 curve_fit.在这种情况下,它快了 6000 倍.当基础数据更复杂时,差异会一点,但差别不大,而且您仍然可以获得巨大的速度提升.以下是拟合 6 个均匀分布的正态分布的示例:

are very similar. I would recommend curve_fit. In this case it's 6000x faster. The difference is a little bit more when the underlying data is more complex, but not by much, and you still get a huge speed up. Here's an example for 6 uniformly distributed normal distributions being fit:

使用curve_fit

这篇关于Python:两个函数之间的重叠(kde 和 normal 的 PDF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆