两个GMM的KL散度 [英] KL-Divergence of two GMMs

查看:300
本文介绍了两个GMM的KL散度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个用来在同一空间中容纳两组不同数据的GMM,我想计算它们之间的KL散度.

I have two GMMs that I used to fit two different sets of data in the same space, and I would like to calculate the KL-divergence between them.

当前,我正在使用sklearn中定义的GMM( http: //scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html )和KL-divergence的SciPy实现(

Currently I am using the GMMs defined in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html) and the SciPy implementation of KL-divergence (http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html)

我将如何去做?我是否只想创建大量随机点,在两个模型(分别称为P和Q)中获取它们的概率,然后将这些概率用作我的输入?还是在SciPy/SKLearn环境中还有其他规范的方法可以做到这一点?

How would I go about doing this? Do I want to just create tons of random points, get their probabilities on each of the two models (call them P and Q) and then use those probabilities as my input? Or is there some more canonical way to do this within the SciPy/SKLearn environment?

推荐答案

对于GMM之间的KL分歧,没有封闭的形式.不过,您可以轻松地进行蒙特卡洛.回想一下KL(p||q) = \int p(x) log(p(x) / q(x)) dx = E_p[ log(p(x) / q(x)).所以:

There's no closed form for the KL divergence between GMMs. You can easily do Monte Carlo, though. Recall that KL(p||q) = \int p(x) log(p(x) / q(x)) dx = E_p[ log(p(x) / q(x)). So:

def gmm_kl(gmm_p, gmm_q, n_samples=10**5):
    X = gmm_p.sample(n_samples)
    log_p_X, _ = gmm_p.score_samples(X)
    log_q_X, _ = gmm_q.score_samples(X)
    return log_p_X.mean() - log_q_X.mean()

(mean(log(p(x) / q(x))) = mean(log(p(x)) - log(q(x))) = mean(log(p(x))) - mean(log(q(x)))在计算上更便宜.)

您不想使用scipy.stats.entropy;这是用于离散分布的.

You don't want to use scipy.stats.entropy; that's for discrete distributions.

如果您想要对称且平滑的 Jensen-Shannon发散 KL(p||(p+q)/2) + KL(q||(p+q)/2),这非常相似:

If you want the symmetrized and smoothed Jensen-Shannon divergence KL(p||(p+q)/2) + KL(q||(p+q)/2) instead, it's pretty similar:

def gmm_js(gmm_p, gmm_q, n_samples=10**5):
    X = gmm_p.sample(n_samples)
    log_p_X, _ = gmm_p.score_samples(X)
    log_q_X, _ = gmm_q.score_samples(X)
    log_mix_X = np.logaddexp(log_p_X, log_q_X)

    Y = gmm_q.sample(n_samples)
    log_p_Y, _ = gmm_p.score_samples(Y)
    log_q_Y, _ = gmm_q.score_samples(Y)
    log_mix_Y = np.logaddexp(log_p_Y, log_q_Y)

    return (log_p_X.mean() - (log_mix_X.mean() - np.log(2))
            + log_q_Y.mean() - (log_mix_Y.mean() - np.log(2))) / 2

(log_mix_X/log_mix_Y实际上是混合物密度的两倍的对数;将其从平均操作中删除可以节省一些触发器.)

(log_mix_X/log_mix_Y are actually the log of twice the mixture densities; pulling that out of the mean operation saves some flops.)

这篇关于两个GMM的KL散度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆