ID3和C4.5:如何获得“增益比"?归一化“增益"? [英] ID3 and C4.5: How Does "Gain Ratio" Normalize "Gain"?
问题描述
ID3算法使用信息增益"度量.
The ID3 algorithm uses "Information Gain" measure.
C4.5使用增益比率"度量,即信息增益除以SplitInfo
,而SplitInfo
对于拆分(记录在不同结果之间均分的情况)较高,否则为SplitInfo
.
The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo
, whereas SplitInfo
is high for a split where records split evenly between different outcomes and low otherwise.
我的问题是:
这如何帮助解决信息获取偏向于产生许多结果的分裂的问题?我看不出原因. SplitInfo
甚至不考虑结果的数量,而只考虑拆分中记录的分布.
How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo
doesn't even take into account the number of outcomes, just the distribution of records in the split.
很可能结果的数量很少(例如2),并且记录在这两个结果之间平均分配.在这种情况下,SplitInfo
较高,增益比率较低,并且C4.5不太可能选择结果很少的分割.
It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo
is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.
另一方面,可能结果数较少,但分布远非均匀.在这种情况下,SplitInfo
低,增益比高,并且更有可能选择具有很多结果的拆分.
On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo
is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.
我想念什么?
推荐答案
SplitInfo甚至不考虑结果的数量,而只考虑拆分中记录的分布.
SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.
但是它确实考虑了结果数量. (即使您也 依赖于分发,如您所述).您的比较是在两种结果数量相同(低")的情况之间进行的,因此无法说明SplitInfo
随着结果数量的变化而发生的变化.
But it does take the number of outcomes into account. (Even if it is also dependent on distribution, as you noted). Your comparison is between two situations with the same ("low") number of outcomes, so it can't possibly illustrate how SplitInfo
changes with a changing number of outcomes.
为简化比较,请考虑以下三种情况,所有情况均分布均匀:
Consider the following 3 situations, all with even distribution for simplicity of comparison:
-
10种可能的结果,分布均匀
10 possible outcomes with even distribution
SplitInfo = -10*(1/10*log2(1/10)) = 3.32
100种可能的结果,分布均匀
100 possible outcomes with even distribution
SplitInfo = -100*(1/100*log2(1/100)) = 6.64
1000种可能的结果,分布均匀
1000 possible outcomes with even distribution
SplitInfo = -1000*(1/1000*log2(1/1000)) = 9.97
因此,如果您必须在3种可能的拆分方案之间进行选择,仅使用ID3中的Information Gain
,则将选择后者.但是,使用GainRatio
中的SplitInfo
,应该清楚的是,随着选择数量的增加向上,SplitInfo
也会升高,而GainRatio
也会升高向下.
So if you have to choose between 3 possible splitting scenarios, using only Information Gain
as in ID3, the latter would be chosen. However, using SplitInfo
in the GainRatio
, it should be clear that as the number of choices goes up, the SplitInfo
will also go up, and the GainRatio
will go down.
所有这些都是在假设拆分均匀分布的情况下进行解释的.但是,即使分布不均匀,上述条件仍然适用. SplitInfo
将随着可能结果数量的增加而增加.是的,如果我们将可能结果的数量保持不变并改变结果的分布,那么SplitInfo
会有一些差异...但是Information Gain
也会有差异.
All of that was explained with an assumption of evenly distributed splits. However, even with non-uniform distribution, the above will still hold true. SplitInfo
will get higher as number of possible outcomes gets higher. Yes, if we hold number of possible outcomes constant and vary outcome distribution, then SplitInfo
will have some variance... but so will Information Gain
.
这篇关于ID3和C4.5:如何获得“增益比"?归一化“增益"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!