使用Monte Carlo与scipy.integrate.nquad的不同积分结果 [英] Different integration results using Monte Carlo vs scipy.integrate.nquad
问题描述
下面的MWE显示了两种相同的2D内核密度估计值的集成方式,这些方法是使用此数据获得的. stats.gaussian_kde()
功能.
The MWE below shows two ways of integrating the same 2D kernel density estimate, obtained for this data using the stats.gaussian_kde()
function.
对阈值点(x1, y1)
以下的所有(x, y)
进行积分,该点定义了积分上限(积分下限为-infinity
;请参阅MWE).
The integration is performed for all (x, y)
below the threshold point (x1, y1)
, which defines the upper integration limits (lower integration limits are -infinity
; see MWE).
- The
int1
function uses simple a Monte Carlo approach. - The
int2
function uses the scipy.integrate.nquad function.
问题在于,int1
(即,蒙特卡洛方法)相对于int2
系统地给出了较大的积分值.我不知道为什么会这样.
The issue is that int1
(ie: the Monte Carlo method) gives systematically larger values for the integral than int2
. I don't know why this happens.
下面是200次int1
(蓝色直方图)运行得到的积分值与int2
(红色垂直线)给出的积分结果相对的示例:
Here's an example of the integral values obtained after 200 runs of int1
(blue histogram) versus the integral result given by int2
(red vertical line):
所得积分值中这种差异的根源是什么?
What is the origin of this difference in the resulting integral value?
MWE
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate
def int1(kernel, x1, y1):
# Compute the point below which to integrate
iso = kernel((x1, y1))
# Sample KDE distribution
sample = kernel.resample(size=50000)
# Filter the sample
insample = kernel(sample) < iso
# The integral is equivalent to the probability of drawing a
# point that gets through the filter
integral = insample.sum() / float(insample.shape[0])
return integral
def int2(kernel, x1, y1):
def f_kde(x, y):
return kernel((x, y))
# 2D integration in: (-inf, x1), (-inf, y1).
integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integral
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5
i2 = int2(kernel, x1, y1)
print i2
int1_vals = []
for _ in range(200):
i = int1(kernel, x1, y1)
int1_vals.append(i)
print i
添加
请注意,此问题源自此答案.起初,我没有注意到答案是在使用的积分限制中弄错了,这解释了为什么int1
和int2
之间的结果不同.
Notice that this question originated from this answer. At first I didn't notice that the answer was mistaken in the integration limits used, which explains why the results between int1
and int2
are different.
int1
集成在域f(x,y)<f(x1,y1)
中(其中f是内核密度估计值),而int2
集成在域(x,y)<(x1,y1)
中.
int1
is integrating in the domain f(x,y)<f(x1,y1)
(where f is the kernel density estimate), while int2
integrates in the domain (x,y)<(x1,y1)
.
推荐答案
您需要重新分配分布
sample = kernel.resample(size=50000)
然后计算每个采样点的概率小于边界处的概率
and then compute the probability for each sampled point is less than the probability at the bound
insample = kernel(sample) < iso
这是不正确的.考虑边界(0,100),并假设您的数据具有u =(0,0)和cov = [[100,0],[0,100]].点(0,50)和(50,0)在此内核中具有相同的概率,但其中只有一个在边界内.由于两者都通过了测试,因此您采样过度.
This is incorrect. Consider the bounds (0,100) and assume your data has u=(0,0) and cov=[[100,0],[0,100]]. Points (0,50) and (50,0) have the same probability in this kernel, but only one of them is in the bounds. Since both pass the test, you are over sampling.
您应该测试sample
中的每个点是否在边界内,然后计算概率.像
You should be testing whether each point in sample
is inside the bounds, then compute the probability. Something like
def int1(kernel, x1, y1):
# Sample KDE distribution
sample = kernel.resample(size=100)
include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
integral = include.sum() / float(sample.shape[1])
return integral
我使用以下代码对此进行了测试
I tested this using the following code
def measure(n):
m1 = np.random.normal(size=n)
m2 = np.random.normal(size=n)
return m1,m2
a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
收益
0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)
Monte Carlo集成应该像这样
Monte Carlo integration should work like this
- 在x/y可能值的某些子集上抽样N个随机值(一致地,不是从您的分布中抽取)(在下面,我用均值的10个SD对其进行限制).
- 对于每个随机值计算内核(rand_x,rand_y)
- 计算总和并乘以(体积)/N_samples个
在代码中:
def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
nsamples = 50000
volume = (x1-lboundx)*(y1-lboundy)
# generate uniform points in range
xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
randvals = np.hstack((xrand,yrand)).transpose()
print randvals.shape
return (volume*kernel(randvals).sum())/nsamples
运行以下内容
print(int1(a,-9,-9))
print(int2(a,-9,-9))
print(mc_wo_sample(a,-9,-9,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
print(mc_wo_sample(a,0,0,-10,-10))
收益
0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674
这篇关于使用Monte Carlo与scipy.integrate.nquad的不同积分结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!