在决策树中寻找连续数据阈值的方法 [英] Method of finding threshold in Decision tree for continuous data

查看:463
本文介绍了在决策树中寻找连续数据阈值的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Weka中使用决策树,并且我有一些连续的数据,因此当我使用Weka时,它会自动为我找到阈值,但是由于某些原因,我想自己实现决策树,因此我需要知道使用哪种方法找到离散化我的连续数据的阈值?

I am using decision tree in Weka and I have some continuous data, so when I use Weka it automatically find the threshold for me but for some reason I want to implement Decision Tree by myself so I need to know what approach to use to find the threshold to discretize my continuous data?

推荐答案

ID3 C4.5 使用

ID3 and C4.5 use entropy heuristic for discretization of continuous data. The method finds a binary cut for each variable (feature). You could apply the same method recursively to get multiple intervals from continuous data.

假设在某个树节点上,所有实例都属于一组S,并且您正在处理变量A和特定的边界(切面)T,即由以下项引起的分区的类信息熵T(表示为E(A,T,S))由:

Suppose at a certain tree node, all instances belong to a set of S, and you are working on variable A and a particular boundary (cut) T, the class information entropy of the partition induced by T, denoted as E(A,T,S) is given by:

             |S1|                 |S2|
E(A, T, S) = ---- Entropy(S1) +   ---- Entropy(S2)
              |S|                 |S|

其中,|S1|是第一个分区中的实例数; |S2|是第二个分区中的实例数; |S| = |S1|+|S2|.

where |S1| is the number of instances in the first partition; |S2| is the number of instances in the second partition; |S| = |S1|+|S2|.

对于给定的特征A,选择在所有可能的分区边界上使熵函数最小的边界T_min作为二进制离散化边界.

For a given feature A, the boundary T_min which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary.

例如,您可能有一个变量Length,其所有可能值为:

For example, you might have a variable Length, with all possible values as:

Length = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

那么您的T可能是:

T = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

在其中剪切每个可能的Length值.您还可以在相邻的Length值的每个中间点处进行剪切,例如

in which you cut at every possible Length value. You could also cut at every middle point of adjacent Length values, e.g.,

T = {2.45, 3.15, 5.75, 9.0, 15.0, 35.0, 50.5}

在离散化时,您将遍历所有可能的T值,并评估哪个值获得了最小的E(A, T, S).就是这样.

At discretization time, you will iterate through all possible T values and evaluate which one obtains the minimum E(A, T, S). That's it.

在此纸张中查看更多详细信息,该文件还介绍了其他可选方法:

See more details in this paper, which also describes other optional methods:

  • ChiMerge离散化方法.
  • 基于学习矢量量化(LVQ)的方法
  • 基于直方图的方法.

这篇关于在决策树中寻找连续数据阈值的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆