在决策树中寻找连续数据阈值的方法 [英] Method of finding threshold in Decision tree for continuous data

查看：463 发布时间：2020/5/4 10:27:42 machine-learning weka decision-tree

本文介绍了在决策树中寻找连续数据阈值的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Weka中使用决策树，并且我有一些连续的数据，因此当我使用Weka时，它会自动为我找到阈值，但是由于某些原因，我想自己实现决策树，因此我需要知道使用哪种方法找到离散化我的连续数据的阈值?

I am using decision tree in Weka and I have some continuous data, so when I use Weka it automatically find the threshold for me but for some reason I want to implement Decision Tree by myself so I need to know what approach to use to find the threshold to discretize my continuous data?

推荐答案

ID3 和 C4.5 使用

ID3 and C4.5 use entropy heuristic for discretization of continuous data. The method finds a binary cut for each variable (feature). You could apply the same method recursively to get multiple intervals from continuous data.

假设在某个树节点上，所有实例都属于一组S，并且您正在处理变量A和特定的边界(切面)T，即由以下项引起的分区的类信息熵T(表示为E(A,T,S))由:

Suppose at a certain tree node, all instances belong to a set of S, and you are working on variable A and a particular boundary (cut) T, the class information entropy of the partition induced by T, denoted as E(A,T,S) is given by:

             |S1|                 |S2|
E(A, T, S) = ---- Entropy(S1) +   ---- Entropy(S2)
              |S|                 |S|

其中，|S1|是第一个分区中的实例数； |S2|是第二个分区中的实例数； |S| = |S1|+|S2|.

where |S1| is the number of instances in the first partition; |S2| is the number of instances in the second partition; |S| = |S1|+|S2|.

对于给定的特征A，选择在所有可能的分区边界上使熵函数最小的边界T_min作为二进制离散化边界.

For a given feature A, the boundary T_min which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary.

例如，您可能有一个变量Length，其所有可能值为:

For example, you might have a variable Length, with all possible values as:

Length = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

那么您的T可能是:

T = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

在其中剪切每个可能的Length值.您还可以在相邻的Length值的每个中间点处进行剪切，例如

in which you cut at every possible Length value. You could also cut at every middle point of adjacent Length values, e.g.,

T = {2.45, 3.15, 5.75, 9.0, 15.0, 35.0, 50.5}

在离散化时，您将遍历所有可能的T值，并评估哪个值获得了最小的E(A, T, S).就是这样.

At discretization time, you will iterate through all possible T values and evaluate which one obtains the minimum E(A, T, S). That's it.

在此纸张中查看更多详细信息，该文件还介绍了其他可选方法:

See more details in this paper, which also describes other optional methods:

ChiMerge离散化方法.
基于学习矢量量化(LVQ)的方法
基于直方图的方法.

这篇关于在决策树中寻找连续数据阈值的方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在决策树中寻找连续数据阈值的方法 [英] Method of finding threshold in Decision tree for continuous data

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

在决策树中寻找连续数据阈值的方法 [英] Method of finding threshold in Decision tree for continuous data

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭