如何在python中计算2、2D kde图之间的公共体积/交点? [英] How to calculate the common volume/intersection between 2, 2D kde plots in python?

查看:74
本文介绍了如何在python中计算2、2D kde图之间的公共体积/交点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 组数据点:

import random
import pandas as pd
A = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})
B = pd.DataFrame({'x':[random.uniform(0, 1) for i in range(0,100)], 'y':[random.uniform(0, 1) for i in range(0,100)]})

对于这些数据集中的每一个,我都可以生成这样的联合图:

For each one of these dataset I can produce the jointplot like this:

import seaborn as sns
sns.jointplot(x=A["x"], y=A["y"], kind='kde')
sns.jointplot(x=B["x"], y=B["y"], kind='kde')

有没有办法计算公共区域"?在这两个联合地块之间?

Is there a way to calculate the "common area" between these 2 joint plots ?

通过公共区域,我的意思是,如果您在内部"放置一个联合地块另一个,交叉点的总面积是多少.因此,如果您将这两个联合地块想象成群山,并且将一座山放置在另一座山中,那么一座山落在另一座山中多少钱?

By common area, I mean, if you put one joint plot "inside" the other, what is the total area of intersection. So if you imagine these 2 joint plots as mountains, and you put one mountain inside the other, how much does one fall inside the other ?

编辑

为了让我的问题更清楚:

To make my question more clear:

import matplotlib.pyplot as plt
import scipy.stats as st

def plot_2d_kde(df):
    # Extract x and y
    x = df['x']
    y = df['y']
    # Define the borders
    deltaX = (max(x) - min(x))/10
    deltaY = (max(y) - min(y))/10
    xmin = min(x) - deltaX
    xmax = max(x) + deltaX
    ymin = min(y) - deltaY
    ymax = max(y) + deltaY

    # Create meshgrid
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

    # We will fit a gaussian kernel using the scipy’s gaussian_kde method
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)

    fig = plt.figure(figsize=(13, 7))
    ax = plt.axes(projection='3d')
    surf = ax.plot_surface(xx, yy, f, rstride=1, cstride=1, cmap='coolwarm', edgecolor='none')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('PDF')
    ax.set_title('Surface plot of Gaussian 2D KDE')
    fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
    ax.view_init(60, 35)

我有兴趣找到这 2 个 kde 图的交集/公共体积(只是数量):

I am interested in finding the interection/common volume (just the number) of these 2 kde plots:

plot_2d_kde(A)
plot_2d_kde(B)

致谢:kde 图的代码来自 这里

Credits: The code for the kde plots is from here

推荐答案

以下代码比较了通过scipy的 dblquad 或通过获取网格上的平均值来计算交叉口的体积.

The following code compares calculating the volume of the intersection either via scipy's dblquad or via taking the average value over a grid.

备注:

  • 对于2D情况(仅具有100个采样点),似乎增量需要大于10%.下面的代码使用25%.增量为10%时, f1 f2 的计算值约为 0.90 ,而理论上它们应为 1.0 .增量为25%时,这些值在 0.994 左右.
  • 要以简单的方式估算体积,需要将平均值乘以面积(此处为(xmax-xmin)*(ymax-ymin)).此外,考虑的网格点越多,近似效果越好.下面的代码使用 1000x1000 网格点.
  • Scipy具有一些用于计算积分的特殊功能,例如 scipy.integrate.dblquad .这比简单"方法要慢得多,但要精确得多.默认精度不起作用,因此下面的代码大大降低了该精度.(dblquad 输出两个数字:近似积分和错误指示.为了只得到积分,代码中使用了 dblquad()[0].)
  • 同样的方法可用于更多维度.对于简单"方法,请创建尺寸更高的网格( xx,yy,zz = np.mgrid [xmin:xmax:100j,ymin:ymax:100j,zmin:zmax:100j] ).请注意,在每个维度上按1000细分都会创建一个太大的网格,无法使用.
  • 使用 scipy.integrate 时,dblquad 需要替换为 tplquad 用于 3 维或 nquad 用于N个维度.这可能还会很慢,因此需要进一步降低准确性.
  • For the 2D case (and with only 100 sample points), it seems the delta's need to be quite larger than 10%. The code below uses 25%. With a delta of 10%, the calculated values for f1 and f2 are about 0.90, while in theory they should be 1.0. With a delta of 25%, these values are around 0.994.
  • To approximate the volume the simple way, the average needs to be multiplied by the area (here (xmax - xmin)*(ymax - ymin)). Also, the more grid points are considered, the better the approximation. The code below uses 1000x1000 grid points.
  • Scipy has some special functions to calculate the integral, such as scipy.integrate.dblquad. This is much slower than the 'simple' method, but a bit more precise. The default precision didn't work, so the code below reduces that precision considerably. (dblquad outputs two numbers: the approximate integral and an indication of the error. To only get the integral, dblquad()[0] is used in the code.)
  • The same approach can be used for more dimensions. For the 'simple' method, create a more dimensional grid (xx, yy, zz = np.mgrid[xmin:xmax:100j, ymin:ymax:100j, zmin:zmax:100j]). Note that a subdivision by 1000 in each dimension would create a grid that's too large to work with.
  • When using scipy.integrate, dblquad needs to be replaced by tplquad for 3 dimensions or nquad for N dimensions. This probably will also be rather slow, so the accuracy needs to be reduced further.
import numpy as np
import pandas as pd
import scipy.stats as st
from scipy.integrate import dblquad

df1 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})
df2 = pd.DataFrame({'x':np.random.uniform(0, 1, 100), 'y':np.random.uniform(0, 1, 100)})

# Extract x and y
x1 = df1['x']
y1 = df1['y']
x2 = df2['x']
y2 = df2['y']
# Define the borders
deltaX = (np.max([x1, x2]) - np.min([x1, x2])) / 4
deltaY = (np.max([y1, y2]) - np.min([y1, y2])) / 4
xmin = np.min([x1, x2]) - deltaX
xmax = np.max([x1, x2]) + deltaX
ymin = np.min([y1, y2]) - deltaY
ymax = np.max([y1, y2]) + deltaY

# fit a gaussian kernel using scipy’s gaussian_kde method
kernel1 = st.gaussian_kde(np.vstack([x1, y1]))
kernel2 = st.gaussian_kde(np.vstack([x2, y2]))

print('volumes via scipy`s dblquad (volume):')
print('  volume_f1 =', dblquad(lambda y, x: kernel1((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print('  volume_f2 =', dblquad(lambda y, x: kernel2((x, y)), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])
print('  volume_intersection =',
    dblquad(lambda y, x: np.minimum(kernel1((x, y)), kernel2((x, y))), xmin, xmax, ymin, ymax, epsabs=1e-4, epsrel=1e-4)[0])

或者,可以计算点网格上的平均值,然后将结果乘以网格面积.请注意, np.mgrid 比通过itertools创建列表要快得多.

Alternatively, one can calculate the mean value over a grid of points, and multiply the result by the area of the grid. Note that np.mgrid is much faster than creating a list via itertools.

# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:1000j, ymin:ymax:1000j]
positions = np.vstack([xx.ravel(), yy.ravel()])
f1 = np.reshape(kernel1(positions).T, xx.shape)
f2 = np.reshape(kernel2(positions).T, xx.shape)
intersection = np.minimum(f1, f2)
print('volumes via the mean value multiplied by the area:')
print('  volume_f1 =', np.sum(f1) / f1.size * ((xmax - xmin)*(ymax - ymin)))
print('  volume_f2 =', np.sum(f2) / f2.size * ((xmax - xmin)*(ymax - ymin)))
print('  volume_intersection =', np.sum(intersection) / intersection.size * ((xmax - xmin)*(ymax - ymin)))

示例输出:

volumes via scipy`s dblquad (volume):
  volume_f1 = 0.9946974276169385
  volume_f2 = 0.9928998852123891
  volume_intersection = 0.9046421634401607
volumes via the mean value multiplied by the area:
  volume_f1 = 0.9927873844924111
  volume_f2 = 0.9910132867915901
  volume_intersection = 0.9028999384136771

这篇关于如何在python中计算2、2D kde图之间的公共体积/交点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆