C4.5算法如何处理连续数据? [英] How does the C4.5 Algorithm handle continuous data?

查看:818
本文介绍了C4.5算法如何处理连续数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在中实现 C4.5算法 .net ,但是我不清楚它如何处理连续(数字)数据。有人可以给我更详细的解释吗?

I am implementing the C4.5 algorithm in .net, however I don't have clear idea of how it deals "continuous (numeric) data". Could someone give me a more detailed explanation?

推荐答案

对于连续数据C4.5使用阈值,其中所有小于阈值的值在左侧节点中,所有大于阈值的内容都在右侧节点中。问题是如何根据给定的数据创建该阈值。诀窍是按照连续变量的升序对数据进行排序。然后遍历数据,在数据成员之间选择一个阈值。例如,如果您的属性x数据为:

For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:

0.5, 1.2, 3.4, 5.4, 6.0

您首先选择一个介于0.5和1.2之间的阈值。在这种情况下,我们可以使用平均值:0.85。现在计算您的杂质:

You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:

H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).

其中l是左侧节点中的样本数,r是右侧节点中的样本数N是要拆分的节点中的样本总数。在上面的示例中,使用x> 0.85作为拆分,然后l = 1,r = 4和N = 5。

Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.

记住计算出的杂质差,现在进行计算对于2和3之间的划分(即x> 2.3)。对每个拆分重复该操作(即n-1个拆分)。然后选择最小化H的拆分。这意味着您的拆分应该比不拆分更加纯净。如果您无法提高结果节点的纯度,请不要拆分它。您还可以设置最小的节点大小,这样就不会导致左侧或右侧节点中仅包含一个样本。

Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.

这篇关于C4.5算法如何处理连续数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆