如何处理C4.5(J48)决策树中缺少的属性值? [英] How to deal with missing attribute values in C4.5 (J48) decision tree?
问题描述
用Weka的C4.5(J48)决策树处理缺失要素属性值的最佳方法是什么?在训练和分类过程中都会出现缺少值的问题.
What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification.
-
如果训练实例中缺少值,那么我假设放置了?"是否正确?功能的价值?
If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature?
假设我能够成功构建决策树,然后从Weka的树结构中以C ++或Java创建自己的树代码.在分类期间,如果我想对新实例进行分类,对于缺少值的要素,我应该赋予什么值?我如何将树下降经过一个值未知的决策节点?
Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new instance, what value do I put for features that have missing values? How would I descend the tree past a decision node for which I have an unknown value?
使用朴素贝叶斯(Naive Bayes)能更好地处理缺失值吗?我会为他们分配一个非常小的非零概率,对吧?
Would using Naive Bayes be better for handling missing values? I would just assign a very small non-zero probability for them, right?
推荐答案
摘自华盛顿大学Pedro Domingos的ML课程:
From Pedro Domingos' ML course in University of Washington:
佩德罗建议使用以下三种方法来弥补A
的缺失值:
Here are three approaches what Pedro suggests for missing value of A
:
-
在其他示例中,
- 将最常见的
A
值分配给节点n
- 为其他目标值相同的示例分配
A
的最常用值 - 将概率
p_i
分配给A
的每个可能值v_i
;将示例的分数p_i
分配给树中的每个后代.
- Assign most common value of
A
among other examples sorted to noden
- Assign most common value of
A
among other examples with same target value - Assign probability
p_i
to each possible valuev_i
ofA
; Assign fractionp_i
of example to each descendant in tree.
现在可以在此处查看幻灯片和视频.
The slides and video is now viewable at here.
这篇关于如何处理C4.5(J48)决策树中缺少的属性值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!