信息获取价值可以为负吗? [英] Can the value of information gain be negative?

查看:138
本文介绍了信息获取价值可以为负吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有机会使信息获取的价值为负?它是根据以下论文中的公式计算的.我无法编写公式,因为它包含一些硬符号.

Is there a chance to get the value of information gain be negative? It is calculated according to the formula in the following paper. I cannot write the formula, because it includes some hard notations.

http://citeseerx.ist.psu.edu

谢谢!

推荐答案

IG(Y|X) = H(Y) - H(Y|X) >= 0,因为H(Y) >= H(Y|X)最坏的情况是X和Y是独立的,因此H(Y|X)=H(Y)

IG(Y|X) = H(Y) - H(Y|X) >= 0 , since H(Y) >= H(Y|X) worst case is that X and Y are independent, thus H(Y|X)=H(Y)

另一种思考方式是,通过观察随机变量X取某个值,我们将不会获得任何信息或获得有关Y的信息(您不会丢失任何信息).

Another way to think about it is that by observing the random variable X taking some value, we either gain no or some information about Y (you don't lose any).

编辑

让我在决策树的背景下阐明信息获取(实际上,我首先想到的是来自机器学习的背景).

Let me clarify information gain in the context of decision trees (which actually I had in mind in the first place as I came from a machine learning background).

假设存在一个分类问题,我们给出给定的一组实例和标签(离散类).

Assume a classification problem where we giving given a set of instances and labels (discrete classes).

选择在树的每个节点上拆分哪个属性的想法是,选择将class属性拆分为两个最可能的实例组(即最低熵)的功能.

The idea of choosing which attribute to split by at each node of the tree, is to select the feature that splits the class attribute into the two purest possible groups of instances (ie lowest entropy).

这反过来等同于选择自此以来具有最高信息增益的功能

This is in turn equivalent to picking the feature with the highest information gain since

InfoGain = entropyBeforeSplit - entropyAfterSplit

其中分裂后的熵是每个分支的熵之和,该熵的总和由该分支下的实例数加权.

where the entropy after the split is the sum of entropies of each branch weighted by the number of instances down that branch.

现在不存在可能的类值拆分,这种拆分将产生比拆分之前更差的纯度(更高的熵)的情况.

Now there exist no possible split of class values that will generate a case with an even worse purity (higher entropy) than before splitting.

以这个简单的二进制分类问题为例.在某个节点上,我们有5个阳性实例和4个阴性实例(总共9个).因此,熵(拆分之前)为:

Take this simple example of a binary classification problem. At a certain node we have 5 positive instances and 4 negative ones (total of 9). Therefore the entropy (before the split) is:

H([4,5]) = -4/9*lg(4/9) -5/9*lg(5/9) = 0.99107606

现在让我们考虑一些分裂情况.最好的情况是,当前属性完美地拆分了实例(即,一个分支全为正,另一个全为负):

Now lets consider some cases of splits. The best case scenario is that the current attribute splits the instances perfectly (ie one branch is all positive, the other all negative):

    [4+,5-]
     /   \        H([4,0],[0,5]) =  4/9*( -4/4*lg(4/4) ) + 5/9*( -5/5*lg(5/5) )
    /     \                      =  0           // zero entropy, perfect split
[4+,0-]  [0+,5-]

然后

IG = H([4,5]) - H([4,0],[0,5]) = H([4,5])       // highest possible in this case

想象一下,第二个属性可能是最坏的情况,其中创建的一个分支没有获得任何实例,而是所有实例都落入另一个实例(例如,如果该属性在各个实例之间是恒定的,则无用) :

Imagine that the second attribute is the worst case possible, where one of the branches created doesn't get any instances rather all instances go down to the other (could happen if for example the attribute is constant across instances, thus useless):

    [4+,5-]
     /   \        H([4,5],[0,0]) =  9/9 * H([4,5]) + 0
    /     \                      =  H([4,5])    // the entropy as before split
[4+,5-]  [0+,0-]

IG = H([4,5]) - H([4,5],[0,0]) = 0              // lowest possible in this case

现在在这两种情况之间的某个位置,您将看到许多情况,例如:

Now somewhere in between these two cases, you will see any number of cases like:

    [4+,5-]
     /   \        H([3,2],[1,3]) =  5/9 * ( -3/5*lg(3/5) -2/5*lg(2/5) )
    /     \                       + 4/9 * ( -1/4*lg(1/1) -3/4*lg(3/4) )
[3+,2-]  [1+,3-]

IG = H([4,5]) - H([3,2],[1,3]) = [...] = 0.31331323

因此,无论您如何分割这9个实例,您总是会获得积极的信息收益.我意识到这不是数学上的证明(为此,请转到MathOverflow!),我只是认为一个实际的示例可能会有所帮助.

so no matter how you split those 9 instances, you always get a positive gain in information. I realize this is no mathematical proof (go to MathOverflow for that!), I just thought an actual example could help.

(注意:所有计算均基于Google)

(Note: All calculations according to Google)

这篇关于信息获取价值可以为负吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆