在决策树中称量样品 [英] Weighing Samples in a Decision Tree

查看:80
本文介绍了在决策树中称量样品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个决策树,该决策树将每个样本均等地加权.现在构造一个决策树,为不同的样本赋予不同的权重.我唯一需要做的更改是在计算信息增益之前找到期望的熵.我有点困惑如何进行,请解释....

I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....

例如:考虑一个包含p个正节点和n个负节点的节点,因此节点的熵为 -p/(p + n)log(p/(p + n))-n/(p+ n)log(n/(p + n)).现在,如果发现分裂,以某种方式将父节点划分为两个子节点,假设子节点1包含p'个正值和n'个负值(因此子节点2包含pp'和n-n').现在,对于子项1我们将计算熵根据为父代计算得出的值,并采用达到该概率的概率,即(p'+ n')/(p + n).现在,预期的熵降低将是 entropy(parent)-(达到child1 * entropy(child1)的概率+达到child2 * entropy(child2)的概率).然后将选择具有最大信息增益的拆分.

For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.

当我们为每个样本提供权重时,现在执行相同的步骤.需要进行哪些更改?需要专门针对adaboost进行哪些更改(仅使用树桩)?

Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?

推荐答案

(我想这与某些评论中的想法相同,例如@Alleo)

(I guess this is the same idea as in some comments, e.g., @Alleo)

假设您有 p 个积极的例子和 n 个消极的例子.让我们将示例的权重表示为:

Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:

a1, a2, ..., ap  ----------  weights of the p positive examples
b1, b2, ..., bn  ----------  weights of the n negative examples

假设

a1 + a2 + ... + ap = A 
b1 + b2 + ... + bn = B

如您所指出的,如果示例具有单位权重,则熵将为:

As you pointed out, if the examples have unit weights, the entropy would be:

    p          p          n         n
- _____ log (____ )  - ______log(______ )
  p + n      p + n      p + n     p + n

现在,您只需将 p 替换为 A ,并将 n 替换为 B ,即可获得新的实例加权熵.

Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.

    A          A          B         B
- _____ log (_____)  - ______log(______ )
  A + B      A + B      A + B     A + B

注意:这里没什么好看的.我们所做的只是弄清楚正面和负面例子组的加权重要性.当对示例进行平均加权时,积极示例的重要性与所有示例的正数与常规数之比成正比.当示例的权重不均等时,我们仅执行加权平均值即可得出积极示例的重要性.

Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.

然后,您按照相同的逻辑,通过比较拆分属性前后的熵来选择具有最大信息增益的属性.

Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.

这篇关于在决策树中称量样品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆