Python互信息的实现 [英] Python's implementation of Mutual Information

查看:1722
本文介绍了Python互信息的实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在实现Python机器学习库提供的互信息功能时遇到一些问题,特别是: sklearn.metrics.mutual_info_score(labels_true,labels_pred,contingency = None)

I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)

( http://scikit-learn.org/stable /modules/generation/sklearn.metrics.mutual_info_score.html )

我正在尝试实现我在Stanford NLP教程站点中找到的示例:

I am trying to implement the example I find in the Stanford NLP tutorial site:

可以在以下位置找到该网站: http: //nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2

The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2

问题是我一直在得到不同的结果,而没有找出原因.

The problem is I keep getting different results, without figuring out the reason yet.

我有了互信息和功能选择的概念,我只是不了解它是如何在Python中实现的.我要做的是,基于NLP站点示例,我为common_info_score方法提供了两个数组,但是它输出的结果不同.另一个有趣的事实是,无论如何您在这些数组上四处移动并更改数字,您最有可能获得相同的结果.我是否应该使用另一种特定于Python的数据结构,或者这背后的问题是什么?如果以前有人成功使用了此功能,对我来说将是很大的帮助,谢谢您的宝贵时间.

I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.

推荐答案

我今天遇到了同样的问题.经过几次试验后,我发现了真正的原因:如果您严格遵循NLP教程,则选择log2,但是sklearn.metrics.mutual_info_score使用自然对数(底数e,欧拉数).我在sklearn文档中找不到此详细信息...

I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation...

我通过以下方式对此进行了验证:

I verified this by:

import numpy as np
def computeMI(x, y):
    sum_mi = 0.0
    x_value_list = np.unique(x)
    y_value_list = np.unique(y)
    Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x)
    Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y)
    for i in xrange(len(x_value_list)):
        if Px[i] ==0.:
            continue
        sy = y[x == x_value_list[i]]
        if len(sy)== 0:
            continue
        pxy = np.array([len(sy[sy==yval])/float(len(y))  for yval in y_value_list]) #p(x,y)
        t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y))
        sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) )
    return sum_mi

如果将此np.log2更改为np.log,我认为它会为您提供与sklearn相同的答案.唯一的区别是,当此方法返回0时,sklearn将返回非常接近0的数字.(当然,如果您不关心日志基础,请使用sklearn,我的代码只是用于演示,它给出的效果很差.性能...)

If you change this np.log2 to np.log, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...)

FYI,1)sklearn.metrics.mutual_info_score接受列表以及np.array; 2)sklearn.metrics.cluster.entropy也使用日志,而不是log2

FYI, 1)sklearn.metrics.mutual_info_score takes lists as well as np.array; 2) the sklearn.metrics.cluster.entropy uses also log, not log2

至于相同结果",我不确定您的真实意思.通常,向量中的值并不重要,重要的是值的分布".您关心的是P(X = x),P(Y = y)和P(X = x,Y = y),而不是值x,y.

as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.

这篇关于Python互信息的实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆