sklearn中的分类树给出不一致的答案 [英] Classification tree in sklearn giving inconsistent answers
问题描述
我正在使用来自 sklearn
的分类树,当我使用相同的数据两次进行模型训练,并使用相同的测试数据进行预测时,我得到了不同结果。我尝试在较小的虹膜数据集上进行再现,并且按预期工作。这是一些代码
I am using a classification tree from sklearn
and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
对于这个小例子, r1
和 r2
是相同的,但是当我运行自己更大的数据集时,结果不同。
r1
and r2
are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
编辑?在研究了一些文档之后,我发现 DecisionTreeClassifier
的输入 random_state
控制起点。通过将此值设置为常数,可以摆脱以前遇到的问题。但是,现在我担心我的模型没有达到最佳状态。建议这样做的方法是什么?尝试一些吗?还是所有结果都应该大致相同?
EDIT After looking into some documentation I see that DecisionTreeClassifier
has an input random_state
which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?
推荐答案
DecisionTreeClassifier
通过基于某些功能的值重复拆分训练数据来工作。通过Scikit-learn实现,您可以通过为 splitter
关键字参数提供值来在几种分割算法之间进行选择。
The DecisionTreeClassifier
works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter
keyword argument.
-
最佳会根据一些标准(您也可以选择;请参见方法签名和<$ c $)随机选择一个特征并为其找到最佳可能的分割c> criterion 参数)。看起来代码执行了N_feature次,因此实际上就像一个引导程序。
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the
criterion
argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
随机会如上所述随机选择要考虑的功能。但是,它还会在该功能上测试随机生成的阈值(随机性,但要受其最小值和最大值之间的约束)。这可能有助于避免在阈值受训练数据中的确切值强烈影响的树上发生量化错误。
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
这两种随机方法都可以提高树木的性能。 Lui,Ting和Fan(2005)的KDD论文中有一些相关的实验结果。
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
如果每次您绝对必须有一棵相同的树,那么我将重用相同的random_state。否则,我希望每棵树最终都会或多或少地等效,并且在没有大量保留数据的情况下,我不确定您如何决定哪棵随机树是最佳的。
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
另请参见:分割器的源代码
这篇关于sklearn中的分类树给出不一致的答案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!