sklearn 中的 2D KDE 带宽与 scipy 中的带宽之间的关系 [英] Relation between 2D KDE bandwidth in sklearn vs bandwidth in scipy

查看:75
本文介绍了sklearn 中的 2D KDE 带宽与 scipy 中的带宽之间的关系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较

将 numpy 导入为 np来自 scipy 导入统计从 sklearn.neighbors 导入 KernelDensity导入 matplotlib.pyplot 作为 plt导入 matplotlib.gridspec 作为 gridspec# 生成随机数据.n = 1000m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)# 定义限制.xmin, xmax = min(m1), max(m1)ymin, ymax = min(m2), max(m2)ext_range = [xmin, xmax, ymin, ymax]# 格式化数据.x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]位置 = np.vstack([x.ravel(), y.ravel()])values = np.vstack([m1, m2])# 定义一些点来评估 KDE.x1, y1 = 0.5, 0.5# 带宽值.体重 = 0.15# -------------------------------------------------------# 使用 scipy 对数据进行核密度估计.# **需要调整带宽以匹配 Sklearn 结果**内核 = stats.gaussian_kde(值,bw_method=bw/np.asarray(values).std(ddof=1))# 获取点的 KDE 值.iso1 = 内核((x1,y1))打印 'iso1 = ', iso1[0]# -------------------------------------------------------# 使用 sklearn 对数据进行核密度估计.kernel_sk = KernelDensity(kernel='gaussian', 带宽=bw).fit(zip(*values))# 获取点的 KDE 值.使用指数,因为 sklearn 返回# 日志值iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))打印 'iso2 = ', iso2[0]# 阴谋fig = plt.figure(figsize=(10, 10))gs = gridspec.GridSpec(1, 2)# Scipyplt.subplot(gs[0])plt.title("Scipy", x=0.5, y=0.92, fontsize=10)# 在网格位置评估内核.k_pos = 内核(位置)kde = np.reshape(k_pos.T, x.shape)plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)#sklearnplt.subplot(gs[1])plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)# 在网格位置评估内核.k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))kde2 = np.reshape(k_pos2.T, x.shape)plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)fig.tight_layout()plt.savefig('KDEs', dpi=300, bbox_inches='tight')

I'm attempting to compare the performance of sklearn.neighbors.KernelDensity versus scipy.stats.gaussian_kde for a two dimensional array.

From this article I see that the bandwidths (bw) are treated differently in each function. The article gives a recipe for setting the correct bw in scipy so it will be equivalent to the one used in sklearn . Basically it divides the bw by the sample standard deviation. The result is this:

# For sklearn
bw = 0.15

# For scipy
bw = 0.15/x.std(ddof=1)

where x is the sample array I'm using to obtain the KDE. This works just fine in 1D, but I can't make it work in 2D.

Here's a MWE of what I got:

import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity

# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])

# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5

# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]

# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])

( iso2 is presented as an exponential since sklearn returns the log values)

The results I get for iso1 and iso2 are different and I'm lost as to how should I affect the bandwidth (in either function) to make them equal (as they should).


Add

I was advised over at sklearn chat (by ep) that I should scale the values in (x,y) before calculating the kernel with scipy in order to obtain comparable results with sklearn.

So this is what I did:

# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)

ie: I scaled both dimensions before getting the kernel with scipy while leaving the line that obtains the kernel in sklearn untouched.

This gave better results but there's still differences in the kernels obtained:

where the red dot is the (x1,y1) point in the code. So as can be seen, there are still differences in the shapes of the density estimates, albeit very small ones. Perhaps this is the best that can be achieved?

解决方案

A couple of years later I tried this and think I got it to work with no re-scaling needed for the data. Bandwidth values do need some scaling though:

# For sklearn
bw = 0.15

# For scipy
bw = 0.15/x.std(ddof=1)

The evaluation of both KDEs for the same point is not exactly equal. For example here's an evaluation for the (x1, y1) point:

iso1 =  0.00984751705005  # Scipy
iso2 =  0.00989788224787  # Sklearn

but I guess it's close enough.

Here's the MWE for the 2D case and the output which, as far as I can see, look almost exactly the same:

import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])

# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15

# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(
    values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]

# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]


# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)

# Scipy
plt.subplot(gs[0])
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)

# Sklearn
plt.subplot(gs[1])
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)

fig.tight_layout()
plt.savefig('KDEs', dpi=300, bbox_inches='tight')

这篇关于sklearn 中的 2D KDE 带宽与 scipy 中的带宽之间的关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆