sklearn GridSearchCV(评分函数错误) [英] sklearn GridSearchCV (Scoring Function error)

查看:90
本文介绍了sklearn GridSearchCV(评分函数错误)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道您是否可以帮助我解决在运行网格搜索时收到的错误.我认为这可能是由于对网格搜索实际工作方式的误解造成的.

I was wondering if you can help me out with an error I am receiving in running grid search. I think it might due to misunderstanding on how grid search actually works.

我现在正在运行一个应用程序,在该应用程序中我需要进行网格搜索以使用不同的评分功能来评估最佳参数.我正在使用RandomForestClassifier将大型X数据集拟合到特征向量Y,该向量是0和1的列表. (完全二进制).我的评分功能(MCC)要求预测输入和实际输入必须完全是二进制的.但是,由于某些原因,我不断收到ValueError:不支持multiclass.

I am now running an application where I need grid search to evaluate best parameters using a different scoring function. I am using RandomForestClassifier to fit a large X dataset to a characterization vector Y which is a list of 0s and 1s. (completely binary). My scoring function (MCC) requires the prediction input and actual input to be completely binary. However, for some reason I keep getting the ValueError: multiclass is not supported.

我的理解是,网格搜索对数据集进行交叉验证,并提供基于交叉验证的预测输入,然后将特征向量和预测插入到函数中.由于我的特征向量完全是二进制的,因此我的预测向量也应该也是二进制的,并且在评估分数时不会造成任何问题. 当我使用单个定义的参数运行随机森林(不使用网格搜索)时,将预测的数据和特征向量插入MCC评分函数运行得很好.因此,我对运行网格搜索将如何导致任何错误感到有些困惑.

My understanding is that the grid search, does cross validation on the data set, comes up with a prediction input that is based on the cross validation, then insets the characterization vector and the prediction into the function. Since my characterization vector is completely binary, my prediction vector should also be binary as well and cause no problem when evaluating the score. When I run random forest with a single defined parameter (without using grid search), inserting the predicted data and characterization vector into MCC scoring functions runs perfectly fine. So I am a little lost on how running the grid search would cause any errors.

数据快照:

        print len(X)
        print X[0]
        print len(Y)
        print Y[2990:3000]
17463699
[38.110903683955435, 38.110903683955435, 38.110903683955435, 9.899495124816895, 294.7808837890625, 292.3835754394531, 293.81494140625, 291.11065673828125, 293.51739501953125, 283.6424865722656, 13.580912590026855, 4.976086616516113, 1.1271398067474365, 0.9465181231498718, 0.5066819190979004, 0.1808401197195053, 0.0]
17463699
[0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]

代码:

def overall_average_score(actual,prediction):
    precision = precision_recall_fscore_support(actual, prediction, average = 'binary')[0]
    recall = precision_recall_fscore_support(actual, prediction, average = 'binary')[1]
    f1_score = precision_recall_fscore_support(actual, prediction, average = 'binary')[2]
    total_score = matthews_corrcoef(actual, prediction)+accuracy_score(actual, prediction)+precision+recall+f1_score
    return total_score/5

grid_scorer = make_scorer(overall_average_score, greater_is_better=True)
parameters = {'n_estimators': [10,20,30], 'max_features': ['auto','sqrt','log2',0.5,0.3], }
random  = RandomForestClassifier()
clf = grid_search.GridSearchCV(random, parameters, cv = 5, scoring = grid_scorer)
clf.fit(X,Y)

错误:

ValueError                                Traceback (most recent call last)
<ipython-input-39-a8686eb798b2> in <module>()
     18 random  = RandomForestClassifier()
     19 clf = grid_search.GridSearchCV(random, parameters, cv = 5, scoring = grid_scorer)
---> 20 clf.fit(X,Y)
     21 
     22 

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    730 
    731         """
--> 732         return self._fit(X, y, ParameterGrid(self.param_grid))
    733 
    734 

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
    503                                     self.fit_params, return_parameters=True,
    504                                     error_score=self.error_score)
--> 505                 for parameters in parameter_iterable
    506                 for train, test in cv)
    507 

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    657             self._iterating = True
    658             for function, args, kwargs in iterable:
--> 659                 self.dispatch(function, args, kwargs)
    660 
    661             if pre_dispatch == "all" or n_jobs == 1:

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
    404         """
    405         if self._pool is None:
--> 406             job = ImmediateApply(func, args, kwargs)
    407             index = len(self._jobs)
    408             if not _verbosity_filter(index, self.verbose):

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
    138         # Don't delay the application, to avoid keeping the input
    139         # arguments in memory
--> 140         self.results = func(*args, **kwargs)
    141 
    142     def get(self):

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1476 
   1477     else:
-> 1478         test_score = _score(estimator, X_test, y_test, scorer)
   1479         if return_train_score:
   1480             train_score = _score(estimator, X_train, y_train, scorer)

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
   1532         score = scorer(estimator, X_test)
   1533     else:
-> 1534         score = scorer(estimator, X_test, y_test)
   1535     if not isinstance(score, numbers.Number):
   1536         raise ValueError("scoring must return a number, got %s (%s) instead."

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/metrics/scorer.pyc in __call__(self, estimator, X, y_true, sample_weight)
     87         else:
     88             return self._sign * self._score_func(y_true, y_pred,
---> 89                                                  **self._kwargs)
     90 
     91 

<ipython-input-39-a8686eb798b2> in overall_average_score(actual, prediction)
      3     recall = precision_recall_fscore_support(actual, prediction, average = 'binary')[1]
      4     f1_score = precision_recall_fscore_support(actual, prediction, average = 'binary')[2]
----> 5     total_score = matthews_corrcoef(actual, prediction)+accuracy_score(actual, prediction)+precision+recall+f1_score
      6     return total_score/5
      7 def show_score(actual,prediction):

/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/metrics/classification.pyc in matthews_corrcoef(y_true, y_pred)
    395 
    396     if y_type != "binary":
--> 397         raise ValueError("%s is not supported" % y_type)
    398 
    399     lb = LabelEncoder()

ValueError: multiclass is not supported

推荐答案

马修斯相关系数是-1和1之间的一个分数.因此,计算f1_score,精度,召回率,precision_score和MCC之间的平均值是不正确的.

Matthews Correlation Coefficient is a score between -1 and 1. So, it is not correct to calculate the average between f1_score, precision, recall, accuracy_score and MCC.

MCC值指示: 1是总正相关 0无相关 -1是总负相关

MCC values indicate: 1 is total positive correlation 0 is no correlation −1 is total negative correlation

虽然上面提到的其他评估指标介于0和1之间(从最差到最佳准确度指标).范围和意义不同.

While the other above mentioned evaluation metrics are between 0 and 1 (from worst to best accuracy index). The range and the significance is not the same.

这篇关于sklearn GridSearchCV(评分函数错误)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆