C4.5算法,连续数据 [英] C4.5 Algorithm, continuous data

查看:297
本文介绍了C4.5算法,连续数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实现了 C4.5算法。净,但我没有对如何处理连续数据(数字数据)明确的想法。可能有人给我一个更详细的解释?

I am implementing the C4.5 algorithm in .net, however I don't have clear idea of how it deals "continuous data(numeric data)". Could someone give me a more detailed explanation?

推荐答案

有关连续数据C4.5使用一个阈值,其中一切小于阈值是在左节点,并且一切大于阈值的推移在正确的节点。现在的问题是如何创建的数据提供给您的是阈值。诀窍有由以升序连续可变数据进行排序。然后遍历拾取数据成员之间的阈值的数据。例如,如果您的数据属性x为:

For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:

0.5, 1.2, 3.4, 5.4, 6.0

您先挑选0.5和1.2之间的阈值。在这种情况下,我们可以只使用平均:0.85。现在,计算你的杂质:

You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:

H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).

其中l是在左节点的样本数,r是在右节点的样本数,而N是在该节点分裂样本的总数。在我们的例子中为x> 0.85作为我们的分割然后升= 1,R = 4和N = 5

Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.

记住所计算的杂质差,现在计算它为2和3(即X> 2.3)之间的分裂。重复,每分裂(即N-1分割)。然后挑选最小小时最分裂。这意味着您的拆分应该比不分裂更加纯净。如果不能提高纯度为得到的节点,然后不拆呢。你也可以有一个节点最小尺寸,这样你就不会结束,只包含一个样品中他们的左边或右边的节点。

Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.

这篇关于C4.5算法,连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆