ID3和C4.5:如何获得“增益比"?归一化“增益"? [英] ID3 and C4.5: How Does "Gain Ratio" Normalize "Gain"?

查看:146
本文介绍了ID3和C4.5:如何获得“增益比"?归一化“增益"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ID3算法使用信息增益"度量.

The ID3 algorithm uses "Information Gain" measure.

C4.5使用增益比率"度量,即信息增益除以SplitInfo,而SplitInfo对于拆分(记录在不同结果之间均分的情况)较高,否则为SplitInfo.

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.

我的问题是:

这如何帮助解决信息获取偏向于产生许多结果的分裂的问题?我看不出原因. SplitInfo甚至不考虑结果的数量,而只考虑拆分中记录的分布.

How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

很可能结果的数量很少(例如2),并且记录在这两个结果之间平均分配.在这种情况下,SplitInfo较高,增益比率较低,并且C4.5不太可能选择结果很少的分割.

It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.

另一方面,可能结果数较少,但分布远非均匀.在这种情况下,SplitInfo低,增益比高,并且更有可能选择具有很多结果的拆分.

On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.

我想念什么?

推荐答案

SplitInfo甚至不考虑结果的数量,而只考虑拆分中记录的分布.

SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

但是它确实考虑了结果数量. (即使您也 依赖于分发,如您所述).您的比较是在两种结果数量相同(低")的情况之间进行的,因此无法说明SplitInfo随着结果数量的变化而发生的变化.

But it does take the number of outcomes into account. (Even if it is also dependent on distribution, as you noted). Your comparison is between two situations with the same ("low") number of outcomes, so it can't possibly illustrate how SplitInfo changes with a changing number of outcomes.

为简化比较,请考虑以下三种情况,所有情况均分布均匀:

Consider the following 3 situations, all with even distribution for simplicity of comparison:

  • 10种可能的结果,分布均匀

  • 10 possible outcomes with even distribution

SplitInfo = -10*(1/10*log2(1/10)) = 3.32

100种可能的结果,分布均匀

100 possible outcomes with even distribution

SplitInfo = -100*(1/100*log2(1/100)) = 6.64

1000种可能的结果,分布均匀

1000 possible outcomes with even distribution

SplitInfo = -1000*(1/1000*log2(1/1000)) = 9.97

因此,如果您必须在3种可能的拆分方案之间进行选择,仅使用ID3中的Information Gain,则将选择后者.但是,使用GainRatio中的SplitInfo,应该清楚的是,随着选择数量的增加向上SplitInfo也会升高,而GainRatio也会升高向下.

So if you have to choose between 3 possible splitting scenarios, using only Information Gain as in ID3, the latter would be chosen. However, using SplitInfo in the GainRatio, it should be clear that as the number of choices goes up, the SplitInfo will also go up, and the GainRatio will go down.

所有这些都是在假设拆分均匀分布的情况下进行解释的.但是,即使分布不均匀,上述条件仍然适用. SplitInfo将随着可能结果数量的增加而增加.是的,如果我们将可能结果的数量保持不变并改变结果的分布,那么SplitInfo会有一些差异...但是Information Gain也会有差异.

All of that was explained with an assumption of evenly distributed splits. However, even with non-uniform distribution, the above will still hold true. SplitInfo will get higher as number of possible outcomes gets higher. Yes, if we hold number of possible outcomes constant and vary outcome distribution, then SplitInfo will have some variance... but so will Information Gain.

这篇关于ID3和C4.5:如何获得“增益比"?归一化“增益"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆