什么是“熵和信息增益"? [英] What is "entropy and information gain"?

查看:116
本文介绍了什么是“熵和信息增益"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读这本书( NLTK ),这很令人困惑. 定义为 :

I am reading this book (NLTK) and it is confusing. Entropy is defined as:

熵是每个标签的概率之和 乘以相同标签的对数概率

Entropy is the sum of the probability of each label times the log probability of that same label

如何在文本挖掘方面应用最大熵?有人可以给我一个简单的例子(视觉)吗?

How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)?

推荐答案

我假设在构建

I assume entropy was mentioned in the context of building decision trees.

为说明起见,想象一下学习模型适合数据,可用于预测新的看不见的名字的性别.

To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name.

name       gender
-----------------        Now we want to predict 
Ashley        f              the gender of "Amro" (my name)
Brian         m
Caroline      f
David         m

第一步是决定什么

First step is deciding what features of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

目标是构建决策树. 的示例为:

The goal is to build a decision tree. An example of a tree would be:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

基本上,每个节点代表对单个属性执行的测试,我们根据测试结果向左或向右移动.我们不断遍历树,直到到达包含类预测(mf)的叶节点

basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (m or f)

因此,如果我们沿这棵树运行名称​​ Amro ,我们首先测试"长度为< 7?",答案为,所以我们沿着那个分支走.在分支之后,下一个测试"是元音的数量<3?"再次计算为 true .这导致标记为m的叶节点,因此预测是 male (我碰巧是,所以树预测了结果

So if we run the name Amro down this tree, we start by testing "is the length<7?" and the answer is yes, so we go down that branch. Following the branch, the next test "is the number of vowels<3?" again evaluates to true. This leads to a leaf node labeled m, and thus the prediction is male (which I happen to be, so the tree predicted the outcome correctly).

决策树是以自上而下的方式构建的,但问题是如何执行您选择在每个节点上拆分哪个属性?答案是找到一种功能,该功能可以将目标类最好地拆分为可能的最纯子节点(即,不包含男性和女性混合的节点,而仅包含一个类的纯节点).

The decision tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).

这种衡量纯度的方法称为 信息 .它表示期望金额Wiki/Self-information"rel =" noreferrer>信息,根据到达节点的示例,该信息用于指定将新实例(名字)归为男性还是女性.我们计算一下 根据节点上的男性和女性类别的数量.

This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it based on the number of male and female classes at the node.

杂质(相反).它是为二进制类定义的,其值a/b为:

Entropy on the other hand is a measure of impurity (the opposite). It is defined for a binary class with values a/b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

二进制熵函数如下图所示(随机变量可以取两个中的一个)值).当概率为p=1/2时,它达到最大值,这意味着p(X=a)=0.5或类似地p(X=b)=0.5具有ab的机会为50%/50%的可能性(不确定性最大).当概率为p=1p=0且具有完全确定性时(分别为p(X=a)=1p(X=a)=0,后者表示p(X=b)=1),熵函数的最小值为零.

This binary entropy function is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is p=1/2, meaning that p(X=a)=0.5 or similarlyp(X=b)=0.5 having a 50%/50% chance of being either a or b (uncertainty is at a maximum). The entropy function is at zero minimum when probability is p=1 or p=0 with complete certainty (p(X=a)=1 or p(X=a)=0 respectively, latter implies p(X=b)=1).

当然,熵的定义可以推广到具有N个结果(不只是两个)的离散随机变量X:

Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two):

(公式中的log通常作为对数2的对数)

(the log in the formula is usually taken as the logarithm to the base 2)

回到我们的名称分类任务,让我们看一个例子.想象一下在构造树的过程中的某个时候,我们正在考虑以下拆分:

Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split:

     ends-vowel
      [9m,5f]          <--- the [..,..] notation represents the class
    /          \            distribution of instances that reached a node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

如您所见,在拆分之前,我们有9位男性和5位女性,即P(m)=9/14P(f)=5/14.根据熵的定义:

As you can see, before the split we had 9 males and 5 females, i.e. P(m)=9/14 and P(f)=5/14. According to the definition of entropy:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

接下来,我们将其与考虑两个子分支的拆分后计算出的熵进行比较.在ends-vowel=1的左侧分支中,我们有:

Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of ends-vowel=1, we have:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

ends-vowel=0的右分支,我们有:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

我们使用每个分支下的实例数作为权重因子来组合左/右熵. (剩下7个实例,剩下7个实例),并在拆分后获得最终的熵:

We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

现在,通过比较拆分前后的熵,我们获得了 信息增益 ,或者通过使用该特定功能进行拆分获得了多少信息:

Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:

Information_Gain = Entropy_before - Entropy_after = 0.1518

您可以按以下方式解释以上计算:通过使用end-vowels功能进行拆分,我们能够将子树预测结果的不确定性减少少量0.1518(在位为信息单元).

You can interpret the above calculation as following: by doing the split with the end-vowels feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in bits as units of information).

在树的每个节点上,对每个特征执行此计算,并选择最大信息增益的特征用于

At each node of the tree, this calculation is performed for every feature, and the feature with the largest information gain is chosen for the split in a greedy manner (thus favoring features that produce pure splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further).

请注意,我跳过了一些详细信息,这些内容超出了本文的范围,包括如何处理数字功能过度拟合修剪树木等.

Note that I skipped over some details which are beyond the scope of this post, including how to handle numeric features, missing values, overfitting and pruning trees, etc..

这篇关于什么是“熵和信息增益"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆