如何在Quinlan的C4.5算法中计算数字属性的阈值? [英] How to calculate the threshold value for numeric attributes in Quinlan's C4.5 algorithm?

查看:193
本文介绍了如何在Quinlan的C4.5算法中计算数字属性的阈值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到C4.5算法如何确定数字属性的阈值.我已经研究并且无法理解,在大多数地方我都找到了以下信息:

I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I've found this information:

首先根据要考虑的属性Y的值对训练样本进行排序.这些值只有有限数目,因此让我们将它们按排序顺序表示为{v1,v2,…,vm}. 将vi和vi + 1之间的任何阈值都具有相同的效果,可以将案例分为属性Y的值位于{v1,v2,…,vi}的情况和值为{vi + 1,vi的情况+2,…,vm}.因此,Y上只有m-1个可能的分割,应该系统地检查所有这些分割以获得最佳分割.

The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1,v2, …,vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

通常选择每个间隔的中点:(vi + vi + 1)/2作为代表阈值. C4.5为每个间隔{vi,vi + 1}选择一个较小的值vi作为阈值,而不是中点本身.

It is usual to choose the midpoint of each interval: (vi +vi+1)/2 as the representative threshold. C4.5 chooses as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself.

我正在研究一个玩/不玩"(价值表)的示例,但不知道如何获得数字75(生成的树),因为晴天时的湿度值为{70,85,90,95 }.

I am studying an example of Play/Dont Play (value table) and do not understand how you get the number 75 (tree generated) for the attribute humidity when the state is sunny because the values ​​of humidity to the sunny state are {70,85,90,95}.

有人知道吗?

推荐答案

正如生成的树图像所暗示的,您按顺序考虑属性.您的75个示例属于Outlook = Sunny Branch.如果根据Outlook = Sunny筛选数据,则会得到下表.

As your generated tree image implies, you consider attributes in order. Your 75 example belongs to outlook = sunny branch. If you filter your data according to outlook = sunny, you get following table.

outlook temperature humidity    windy   play
sunny   69           70         FALSE   yes
sunny   75           70         TRUE    yes
sunny   85           85         FALSE   no
sunny   80           90         TRUE    no
sunny   72           95         FALSE   no

如您所见,在这种情况下,湿度阈值为"<75".

As you can see, threshold for humidity is "< 75" for this condition.

j4.8是 ID3算法的后继者.它使用信息增益和熵来决定最佳分割.根据维基百科

j4.8 is successor to ID3 algorithm. It uses information gain and entropy to decide best split. According to wikipedia

The attribute with the smallest entropy 
is used to split the set on this iteration. 
The higher the entropy, 
the higher the potential to improve the classification here.

这篇关于如何在Quinlan的C4.5算法中计算数字属性的阈值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆