决策树.选择分割对象的阈值 [英] Decision trees. Choosing thresholds to split objects

查看:353
本文介绍了决策树.选择分割对象的阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我正确理解了,则一组呈现了对象(即要素数组),我们需要将其分为2个子集.为此,我们将某些特征x j 与阈值t m (t m 是m个节点的阈值)进行比较.我们使用杂质函数H()来找到分割对象的最佳方法.但是,我们如何选择t m 的值,以及应该将哪个特征与阈值进行比较?我的意思是,我们可以选择t m 的方法有无数种,因此我们不能只为每种可能性计算H()函数.

If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature xj to a threshold tm (tm is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of tm and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose tm so we can't just compute H() function for each possibility.

推荐答案

在这些

In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.

方法1:

  • 根据X将数据排序为{x_1,...,x_m}
  • 考虑x_i +(x_ {i + 1}-x_i)/2形式的分割点

方法2:

假设X是一个实值变量

  • 将IG(Y | X:t)定义为H(Y)-H(Y | X:t)

  • Define IG(Y|X:t) as H(Y) - H(Y|X:t)

定义H(Y | X:t)= H(Y | X = t)P(X> = t)

Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

  • IG(Y | X:t)是预测所有Y的信息增益 知道X是否大于或小于t
  • IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is greater than or less than t

然后定义IG ^ *(Y | X)= max_t IG(Y | X:t)

Then define IG^*(Y|X) = max_t IG(Y|X:t)

对于每个实值属性,请使用IG *(Y | X)来评估其作为拆分的适用性

For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split

注意,可能会在一个属性上多次拆分, 具有不同的阈值

Note, may split on an attribute multiple times, with different thresholds

这篇关于决策树.选择分割对象的阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆