Python中的一维Wasserstein距离 [英] 1D Wasserstein distance in Python

查看:42
本文介绍了Python中的一维Wasserstein距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的公式是当源和目标分布 xy(也称为边缘分布)是一维时,Wasserstein 距离/最优传输的特例,也就是说,是向量.

The formula below is a special case of the Wasserstein distance/optimal transport when the source and target distributions, x and y (also called marginal distributions) are 1D, that is, are vectors.

其中 F^{-1} 是边缘 uv 的累积分布的逆概率分布函数,从实数导出数据称为 xy,均从正态分布生成:

where F^{-1} are inverse probability distribution functions of the cumulative distributions of the marginals u and v, derived from real data called x and y, both generated from the normal distribution:

import numpy as np
from numpy.random import randn
import scipy.stats as ss

n = 100
x = randn(n)
y = randn(n)

公式中的积分如何用python和scipy编码?我猜 x 和 y 必须转换为非负的排序边际,总和为 1,而 Scipy 的 ppf 可用于计算逆 F^{-1}的?

How can the integral in the formula be coded in python and scipy? I'm guessing the x and y have to be converted to ranked marginals, which are non-negative and sum to 1, while Scipy's ppf could be used to calculate the inverse F^{-1}'s?

推荐答案

请注意,当 n 变大时,我们有 n 个已排序的样本集接近逆 CDF以 1/n、2/n、...、n/n 采样.例如:

Note that when n gets large we have that a sorted set of n samples approaches the inverse CDF sampled at 1/n, 2/n, ..., n/n. E.g.:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.plot(norm.ppf(np.linspace(0, 1, 1000)), label="invcdf")
plt.plot(np.sort(np.random.normal(size=1000)), label="sortsample")
plt.legend()
plt.show()

另请注意,从 0 到 1 的积分可以近似为 1/n、2/n、...、n/n 的和.

Also note that your integral from 0 to 1 can be approximated as a sum over 1/n, 2/n, ..., n/n.

因此我们可以简单地回答您的问题:

Thus we can simply answer your question:

def W(p, u, v):
    assert len(u) == len(v)
    return np.mean(np.abs(np.sort(u) - np.sort(v))**p)**(1/p)

注意,如果 len(u) != len(v) 你仍然可以应用线性插值的方法:

Note that if len(u) != len(v) you can still apply the method with linear interpolation:

def W(p, u, v):
    u = np.sort(u)
    v = np.sort(v)
    if len(u) != len(v):
        if len(u) > len(v): u, v = v, u
        us = np.linspace(0, 1, len(u))
        vs = np.linspace(0, 1, len(v))
        u = np.linalg.interp(u, us, vs)
    return np.mean(np.abs(u - v)**p)**(1/p)


如果您有关于数据分布类型的先验信息,而不是其参数,另一种方法是找到数据的最佳拟合分布(例如使用 scipy.stats.norm.fitcode>) 用于 uv ,然后以所需的精度进行积分.例如:


An alternative method if you have prior information about the sort of distribution of your data, but not its parameters, is to find the best fitting distribution on your data (e.g. with scipy.stats.norm.fit) for both u and v and then do the integral with the desired precision. E.g.:

from scipy.stats import norm as gauss
def W_gauss(p, u, v, num_steps):
    ud = gauss(*gauss.fit(u))
    vd = gauss(*gauss.fit(v))
    z = np.linspace(0, 1, num_steps, endpoint=False) + 1/(2*num_steps)
    return np.mean(np.abs(ud.ppf(z) - vd.ppf(z))**p)**(1/p)

这篇关于Python中的一维Wasserstein距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆