sklearn中的分类树给出不一致的答案 [英] Classification tree in sklearn giving inconsistent answers

查看:223
本文介绍了sklearn中的分类树给出不一致的答案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用来自 sklearn 的分类树,当我使用相同的数据两次进行模型训练,并使用相同的测试数据进行预测时,我得到了不同结果。我尝试在较小的虹膜数据集上进行再现,并且按预期工作。这是一些代码

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code

from sklearn import tree
from sklearn.datasets import iris

clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)

clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)

对于这个小例子, r1 r2 是相同的,但是当我运行自己更大的数据集时,结果不同。

r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?

编辑?在研究了一些文档之后,我发现 DecisionTreeClassifier 的输入 random_state 控制起点。通过将此值设置为常数,可以摆脱以前遇到的问题。但是,现在我担心我的模型没有达到最佳状态。建议这样做的方法是什么?尝试一些吗?还是所有结果都应该大致相同?

EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

推荐答案

DecisionTreeClassifier 通过基于某些功能的值重复拆分训练数据来工作。通过Scikit-learn实现,您可以通过为 splitter 关键字参数提供值来在几种分割算法之间进行选择。

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.


  • 最佳会根据一些标准(您也可以选择;请参见方法签名和<$ c $)随机选择一个特征并为其找到最佳可能的分割c> criterion 参数)。看起来代码执行了N_feature次,因此实际上就像一个引导程序。

  • "best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.

随机会如上所述随机选择要考虑的功能。但是,它还会在该功能上测试随机生成的阈值(随机性,但要受其最小值和最大值之间的约束)。这可能有助于避免在阈值受训练数据中的确切值强烈影响的树上发生量化错误。

"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.

这两种随机方法都可以提高树木的性能。 Lui,Ting和Fan(2005)的KDD论文中有一些相关的实验结果。

Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.

如果每次您绝对必须有一棵相同的树,那么我将重用相同的random_state。否则,我希望每棵树最终都会或多或少地等效,并且在没有大量保留数据的情况下,我不确定您如何决定哪棵随机树是最佳的。

If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.

另请参见:分割器的源代码

这篇关于sklearn中的分类树给出不一致的答案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆