为什么用python编写的决策树代码与用R编写的代码预测不同? [英] Why Decision Tree code written in python predicts differently than the code written in R?

查看:55
本文介绍了为什么用python编写的决策树代码与用R编写的代码预测不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python和R中sklearn的load_iris数据集(在R中简称为虹膜").

I am working with load_iris data set from sklearn in python and R(it's just called iris in R).

我使用"gini"索引使用两种语言构建了模型,并且当直接从虹膜数据集中获取测试数据时,我能够正确地测试模型.

I built the model in both language using "gini" index and in both languages I am able to test the model properly when the test data is taken directly from the iris data set.

但是,如果我给出一个新的数据集作为测试输入,那么对于相同的python和R,它会将其分为不同的类别.

However if I give a new data set as a test input, for the same python and R puts it into different categories.

我不确定我在这里遗漏了什么或做错了什么,因此任何指导将不胜感激.

I'm not sure what am I missing here or doing wrong, so any guidance will be very much appreciated.

下面提到的代码:Python 2.7:

Code mentioned below: Python 2.7:

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
model = tree.DecisionTreeClassifier(criterion='gini')
model.fit(iris.data, iris.target)
model.score(iris.data, iris.target)
print iris.data[49],model.predict([iris.data[49]])
print iris.data[99],model.predict([iris.data[99]])
print iris.data[100],model.predict([iris.data[100]])
print iris.data[149],model.predict([iris.data[149]])
print [6.3,2.8,6,1.3],model.predict([[6.3,2.8,6,1.3]])

运行3.3.2 32位的R-Rstudio:

R-Rstudio running 3.3.2 32 bit:

library(rpart)
iris<- iris
x_train = iris[c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width')]
y_train = as.matrix(cbind(iris['Species']))
x <- cbind(x_train,y_train)
fit <- rpart(y_train ~ ., data = x_train,method="class",parms = list(split = "gini"))
summary(fit)
x_test = x[149,]
x_test[,1]=6.3
x_test[,2]=2.8
x_test[,3]=6
x_test[,4]=1.3
predicted1= predict(fit,x[49,]) # same as python result
predicted2= predict(fit,x[100,]) # same as python result 
predicted3= predict(fit,x[101,]) # same as python result
predicted4= predict(fit,x[149,]) # same as python result
predicted5= predict(fit,x_test) ## this value does not match with pythons result

我的python输出是:

My python output is :

[ 5.   3.3  1.4  0.2] [0]
[ 5.7  2.8  4.1  1.3] [1]
[ 6.3  3.3  6.   2.5] [2]
[ 5.9  3.   5.1  1.8] [2]
[6.3, 2.8, 6, 1.3] [2] -----> this means it's putting the test data into virginica bucket

R输出为:

> predicted1
   setosa versicolor virginica
49      1          0         0
> predicted2
    setosa versicolor  virginica
100      0  0.9074074 0.09259259
> predicted3
    setosa versicolor virginica
101      0 0.02173913 0.9782609
> predicted4
    setosa versicolor virginica
149      0 0.02173913 0.9782609
> predicted5
    setosa versicolor  virginica
149      0  0.9074074 0.09259259 --> this means it's putting the test data into versicolor bucket

请帮助.谢谢.

推荐答案

决策树涉及很多参数(最小/最大叶子大小,树的深度,何时拆分等),并且不同的程序包可能具有不同的默认设置.如果要获得相同的结果,则需要确保隐式默认值相似.例如,尝试运行以下命令:

Decision trees involve quite a few parameters (min / max leave size, depth of tree, when to split etc), and different packages may have different default settings. If you want to get the same results, you need to make sure the implicit defaults are similar. For instance, try running the following:

fit <- rpart(y_train ~ ., data = x_train,method="class",
             parms = list(split = "gini"), 
             control = rpart.control(minsplit = 2, minbucket = 1, xval=0, maxdepth = 30))

(predicted5= predict(fit,x_test))
    setosa versicolor virginica
149      0  0.3333333 0.6666667

在这里,选择选项 minsplit = 2,minbucket = 1,xval = 0 maxdepth = 30 ,以便与 sklearn -options,请参见此处. maxdepth = 30 rpart 所允许的最大值; sklearn 在这里没有限制).如果您希望概率等值相同,则可能还希望使用 cp 参数.

Here, the options minsplit = 2, minbucket = 1, xval=0 and maxdepth = 30 are chosen so as to be identical to the sklearn-options, see here. maxdepth = 30 is the largest value rpart will let you have; sklearn has no bound here). If you want probabilities etc to be identical, you probably want to play around with the cp parameter as well.

类似地,

model = tree.DecisionTreeClassifier(criterion='gini', 
                                    min_samples_split=20, 
                                    min_samples_leaf=round(20.0/3.0), max_depth=30)
model.fit(iris.data, iris.target)

我知道

print model.predict([iris.data[49]])
print model.predict([iris.data[99]])
print model.predict([iris.data[100]])
print model.predict([iris.data[149]])
print model.predict([[6.3,2.8,6,1.3]])

[0]
[1]
[2]
[2]
[1]

看起来与您的初始 R 输出非常相似.

which looks pretty similar to your initial R output.

不用说,当您的预测(对于训练集)看起来过分合理"时要小心,因为您可能会过度拟合数据.例如,看看 model.predict_proba(...),它为您提供了 sklearn 中的概率(而不是预测的类).您应该看到,使用当前的Python代码/设置,您几乎肯定会过拟合.

Needless to say, be careful when your predictions (on the training set) seem "unreasonably good", as you are likely to overfit the data. For instance, have a look at model.predict_proba(...), which gives you the probabilities in sklearn (instead of the predicted classes). You should see that with your current Python code / settings, you are almost surely overfitting.

这篇关于为什么用python编写的决策树代码与用R编写的代码预测不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆