具有复杂性或性能比较的不同决策树算法 [英] Different decision tree algorithms with comparison of complexity or performance

查看:132
本文介绍了具有复杂性或性能比较的不同决策树算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究数据挖掘,更确切地说是决策树.

我想知道是否有多种算法可以构建决策树(或者只有一个?),并且基于诸如以下的标准,哪种更好?

  • 性能
  • 复杂度
  • 决策错误
  • 等等.

解决方案

决策树的实现主要在以下方面有所不同:

  • 拆分标准 (即方差"的计算方式)

  • 是否为 回归 建立模型(连续变量,例如a 得分)以及 分类 (离散变量,例如,一类 标签)

  • 消除/减少 过度拟合

  • 的技术
  • 它是否可以处理 不完整的数据


主要的决策树实现为:

  • ID3 (或迭代二分法)是三个决策树中的第一个 罗斯·昆兰(Quinlan,J. R. 1986.决策树归纳法.机器学习.1,1(1986年3月),第81-106页)开发的实现.

  • CART 分类和回归树通常用作通用名称 决策树"一词的首字母缩写词,尽管它显然具有更具体的含义.总而言之,CART的实现与C4.5非常相似.一个显着的差异是CART基于递归应用于数据的数值拆分准则构造树,而C4.5包括构造规则集 s的中间步骤.

  • C4.5 ,昆兰的下一个迭代.新功能(相对于ID3)是: (i)接受连续特征和离散特征; (ii)手柄 数据点不完整; (iii)通过(非常 自下而上的技巧,通常称为修剪";和(iv) 可以应用不同的权重,这些特征包括 训练数据.其中,前三个是非常重要的-我建议您选择的任何DT实现都具有这三个.第四(差分加权)的重要性要小得多

  • C5.0 ,这是最新的Quinlan迭代.该实现是 包含在专利中,因此可能很少实施 (商业软件包之外).我从未编码过C5.0 我自己实现(我什至从未见过源代码),所以我无法提供C5.0与C4.5的知情比较.我一直 对其发明者声称的改进持怀疑态度(Ross 昆兰(Quinlan))-例如,他声称这是几个数量级" 比C4.5快.其他要求也同样广泛(显着提高存储效率"),依此类推.我只是将您指向研究 其中报告了两种技术的比较结果,您可以自己决定.

  • CHAID (卡方自动交互检测器)实际上早于 最初的ID3实施大约需要六年的时间(在 博士戈登·卡斯(Gordon Kass)在1980年撰写的论文).我对此技术一无所知.R平台有一个名为 CHAID 的软件包.哪个 包括出色的文档

  • MARS (多自适应回归样条曲线)实际上是MARS的原始发明者Salford Systems的商标.作为一个 结果,非Salford出售的库中的MARS克隆被命名为MARS以外的其他名称-例如,在R中,相关功能是poly-spline库中的polymars. Matlab和Statistica也有 具有MARS功能的实现

我建议使用CART或C4.5(尽管再次,尽管我熟悉它们的功能集,但我对C5.0或CHAID没有直接的经验).

C4.5是在橙色中实现的决策树风格; CART是 sklearn 中的风格-都是出色的ML库中的出色实现. /p> 在 range 方面,

C4.5是超越ID3-的重要一步(C4.5具有更广泛的用例范围,因为它可以处理训练数据中的连续变量)和模型质量的条款.

C5.0与C4.5相比,最显着的改进就是对 增强树 的支持.对DT的集成支持(增强树和随机森林)已包含在Orange的DT实现中;在这里,集成支持已添加到C4.5算法中. sklearn还具有一系列随机森林和增强方法.

I am doing research on data mining and more precisely, decision trees.

I would like to know if there are multiple algorithms to build a decision trees (or just one?), and which is better, based on criteria such as

  • Performance
  • Complexity
  • Errors in decision making
  • and more.

解决方案

Decision Tree implementations differ primarily along these axes:

  • the splitting criterion (i.e., how "variance" is calculated)

  • whether it builds models for regression (continuous variables, e.g., a score) as well as classification (discrete variables, e.g., a class label)

  • technique to eliminate/reduce over-fitting

  • whether it can handle incomplete data


The major Decision Tree implementations are:

  • ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)

  • CART, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing rule sets.

  • C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data. Of these, the first three are very important--and I would suggest that any DT implementation you choose has all three. The fourth (differential weighting) is much less important

  • C5.0, the most recent Quinlan iteration. This implementation is covered by a patent and probably, as a result, is rarely implemented (outside of commercial software packages). I have never coded a C5.0 implementation myself (I have never even seen the source code) so I can't offer an informed comparison of C5.0 versus C4.5. I have always been skeptical about the improvements claimed by its inventor (Ross Quinlan)--for instance, he claims it is "several orders of magnitude" faster than C4.5. Other claims are similarly broad ("significantly more memory efficient") and so forth. I'll just point you to studies which report the result of comparison of the two techniques and you can decide for yourself.

  • CHAID (chi-square automatic interaction detector) actually predates the original ID3 implementation by about six years (published in a Ph.D. thesis by Gordon Kass in 1980). I know every little about this technique.The R Platform has a Package called CHAID which includes excellent documentation

  • MARS (multi-adaptive regression splines) is actually a term trademarked by the original inventor of MARS, Salford Systems. As a result, MARS clones in libraries not sold by Salford are named something other than MARS--e.g., in R, the relevant function is polymars in the poly-spline library. Matlab and Statistica also have implementations with MARS-functionality

I would recommend CART or C4.5 (though again, I have no direct experience with C5.0 or with CHAID, though I am familiar with their feature sets).

C4.5 is the Decision Tree flavor implemented in Orange; CART is the flavor in sklearn--both excellent implementations in excellent ML libraries.

C4.5 is a major step beyond ID3--both in terms of range (C4.5 has a far broader use case spectrum because it can handle continuous variables in the training data) and in terms of model quality.

Perhaps the most significant claimed improvement of C5.0 versus C4.5 is support for boosted trees. Ensemble support for DTs--boosted trees and Random Forests--has been included in the DT implementation in Orange; here, ensemble support was added to a C4.5 algorithm. sklearn also features a range of random forest and boosting methods.

这篇关于具有复杂性或性能比较的不同决策树算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆