使用numpy计算文本文档之间的Kullback-Leibler(KL)距离 [英] Computation of Kullback-Leibler (KL) distance between text-documents using numpy

查看:229
本文介绍了使用numpy计算文本文档之间的Kullback-Leibler(KL)距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是计算以下文本文档之间的KL距离:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY

我首先将文档矢量化,以便轻松应用numpy

1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]

然后我应用了以下代码来计算文本之间的KL距离:

import numpy as np
import math
from math import log

v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
    p = np.asarray(p, dtype=np.float)
    q = np.asarray(q, dtype=np.float)
    return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
    KL=kl(x,c)
    print KL

这是上面的代码:[0.0, 0.602059991328, 0.0]的结果. 文本1和3完全不同,但是它们之间的距离为0,而高度相关的文本1和2的距离为0.602059991328.这不正确.

有人对我在KL方面做的不对的想法吗?非常感谢您的建议.

解决方案

尽管我不想添加其他答案,但这里有两点.首先,正如Jaime在评论中指出的那样,KL散度(或距离-根据以下文档,它们是相同的)旨在测量概率分布之间的差异.基本上,这意味着传递给函数的内容应该是两个类似数组的元素,每个元素的总和为1.

第二,scipy显然可以实现这一点,其命名方案与信息理论领域更为相关.函数是熵":

scipy.stats.entropy(pk, qk=None, base=None)

http://docs.scipy. org/doc/scipy-dev/reference/generation/scipy.stats.entropy.html

从文档中

如果qk不为None,则计算相对熵(也称为 Kullback-Leibler散度或Kullback-Leibler距离)S = sum(pk * log(pk/qk),axis = 0).

此函数的好处还在于,如果它们不等于1,将对传递的向量进行归一化(尽管这意味着您必须小心传递的数组-即,如何从数据构造它们).

希望这会有所帮助,至少有一个库可以提供它,所以不必自己编写代码.

My goal is to compute the KL distance between the following text documents:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY

I first of all vectorised the documents in order to easily apply numpy

1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]

I then applied the following code for computing KL distance between the texts:

import numpy as np
import math
from math import log

v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
    p = np.asarray(p, dtype=np.float)
    q = np.asarray(q, dtype=np.float)
    return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
    KL=kl(x,c)
    print KL

Here is the result of the above code: [0.0, 0.602059991328, 0.0]. Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.

Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.

解决方案

Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.

Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":

scipy.stats.entropy(pk, qk=None, base=None)

http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html

From the docs:

If qk is not None, then compute a relative entropy (also known as Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk * log(pk / qk), axis=0).

The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).

Hope this helps, and at least a library provides it so don't have to code your own.

这篇关于使用numpy计算文本文档之间的Kullback-Leibler(KL)距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆