评估sklearn cross_val_score的多个分数 [英] Evaluate multiple scores on sklearn cross_val_score

查看:60
本文介绍了评估sklearn cross_val_score的多个分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用sklearn评估多种机器学习算法,以评估几个指标(准确性,召回率,精度,甚至更多).

根据我在文档此处和源代码的理解(我使用的是sklearn 0.17), cross_val_score 函数每次执行仅接收一个计分器.因此,要计算多个分数,我必须:

  1. 多次执行
  2. 实施我的(耗时且容易出错的)计分器

    我已经用此代码执行了多次:

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_iris
    
    iris = load_iris()
    
    models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    for model, name in zip(models, names):
        print name
        start = time.time()
        for score in ["accuracy", "precision", "recall"]:
            print score,
            print " : ",
            print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()
        print time.time() - start
    

我得到以下输出:

Naive Bayes
accuracy  :  0.953333333333
precision  :  0.962698412698
recall  :  0.953333333333
0.0383198261261
Decision Tree
accuracy  :  0.953333333333
precision  :  0.958888888889
recall  :  0.953333333333
0.0494720935822
SVM
accuracy  :  0.98
precision  :  0.983333333333
recall  :  0.98
0.063080072403

还可以,但是对于我自己的数据来说很慢.如何测量所有分数?

解决方案

自撰写本文后,scikit-learn已更新并且使我的答案过时了,请参阅下面的更简洁的解决方案


您可以编写自己的评分函数来捕获所有三条信息,但是用于交叉验证的评分函数只能在scikit-learn中返回一个数字(这可能是出于兼容性原因).下面是一个示例,其中每个交叉验证切片的每个分数都会打印到控制台,并且返回的值只是这三个指标的总和.如果要返回所有这些值,则必须对cross_val_score(cross_validation.py的1351行)和_score(1601或同一文件)进行一些更改.

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import  cross_val_score
import time
from sklearn.datasets import  load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_score

iris = load_iris()

models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
names = ["Naive Bayes", "Decision Tree", "SVM"]

def getScores(estimator, x, y):
    yPred = estimator.predict(x)
    return (accuracy_score(y, yPred), 
            precision_score(y, yPred, pos_label=3, average='macro'), 
            recall_score(y, yPred, pos_label=3, average='macro'))

def my_scorer(estimator, x, y):
    a, p, r = getScores(estimator, x, y)
    print a, p, r
    return a+p+r

for model, name in zip(models, names):
    print name
    start = time.time()
    m = cross_val_score(model, iris.data, iris.target,scoring=my_scorer, cv=10).mean()
    print '\nSum:',m, '\n\n'
    print 'time', time.time() - start, '\n\n'

哪个给了:

Naive Bayes
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.904761904762 0.866666666667
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86936507937 


time 0.0249638557434 


Decision Tree
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.866666666667 0.866666666667
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86555555556 


time 0.0237860679626 


SVM
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.94333333333 


time 0.043044090271 


从scikit-learn 0.19.0开始,该解决方案变得容易得多

from sklearn.model_selection import cross_validate
from sklearn.datasets import  load_iris
from sklearn.svm import SVC

iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                         cv=5, return_train_score=True)
print(scores.keys())
print(scores['test_acc'])  

哪个给:

['test_acc', 'score_time', 'train_acc', 'fit_time', 'test_rec_micro', 'train_rec_micro', 'train_prec_macro', 'test_prec_macro']
[ 0.96666667  1.          0.96666667  0.96666667  1.        ]

I'm trying to evaluate multiple machine learning algorithms with sklearn for a couple of metrics (accuracy, recall, precision and maybe more).

For what I understood from the documentation here and from the source code(I'm using sklearn 0.17), the cross_val_score function only receives one scorer for each execution. So for calculating multiple scores, I have to :

  1. Execute multiple times
  2. Implement my (time consuming and error prone) scorer

    I've executed multiple times with this code :

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_iris
    
    iris = load_iris()
    
    models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    for model, name in zip(models, names):
        print name
        start = time.time()
        for score in ["accuracy", "precision", "recall"]:
            print score,
            print " : ",
            print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()
        print time.time() - start
    

And I get this output:

Naive Bayes
accuracy  :  0.953333333333
precision  :  0.962698412698
recall  :  0.953333333333
0.0383198261261
Decision Tree
accuracy  :  0.953333333333
precision  :  0.958888888889
recall  :  0.953333333333
0.0494720935822
SVM
accuracy  :  0.98
precision  :  0.983333333333
recall  :  0.98
0.063080072403

Which is ok, but it's slow for my own data. How can I measure all scores ?

解决方案

Since the time of writing this post scikit-learn has updated and made my answer obsolete, see the much cleaner solution below


You can write your own scoring function to capture all three pieces of information, however a scoring function for cross validation must only return a single number in scikit-learn (this is likely for compatibility reasons). Below is an example where each of the scores for each cross validation slice prints to the console, and the returned value is just the sum of the three metrics. If you want to return all these values, you're going to have to make some changes to cross_val_score (line 1351 of cross_validation.py) and _score (line 1601 or the same file).

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import  cross_val_score
import time
from sklearn.datasets import  load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_score

iris = load_iris()

models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
names = ["Naive Bayes", "Decision Tree", "SVM"]

def getScores(estimator, x, y):
    yPred = estimator.predict(x)
    return (accuracy_score(y, yPred), 
            precision_score(y, yPred, pos_label=3, average='macro'), 
            recall_score(y, yPred, pos_label=3, average='macro'))

def my_scorer(estimator, x, y):
    a, p, r = getScores(estimator, x, y)
    print a, p, r
    return a+p+r

for model, name in zip(models, names):
    print name
    start = time.time()
    m = cross_val_score(model, iris.data, iris.target,scoring=my_scorer, cv=10).mean()
    print '\nSum:',m, '\n\n'
    print 'time', time.time() - start, '\n\n'

Which gives:

Naive Bayes
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.904761904762 0.866666666667
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86936507937 


time 0.0249638557434 


Decision Tree
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.866666666667 0.866666666667
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86555555556 


time 0.0237860679626 


SVM
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.94333333333 


time 0.043044090271 


As of scikit-learn 0.19.0 the solution becomes much easier

from sklearn.model_selection import cross_validate
from sklearn.datasets import  load_iris
from sklearn.svm import SVC

iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                         cv=5, return_train_score=True)
print(scores.keys())
print(scores['test_acc'])  

Which gives:

['test_acc', 'score_time', 'train_acc', 'fit_time', 'test_rec_micro', 'train_rec_micro', 'train_prec_macro', 'test_prec_macro']
[ 0.96666667  1.          0.96666667  0.96666667  1.        ]

这篇关于评估sklearn cross_val_score的多个分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆