sklearn GridSearchCV(评分函数错误) [英] sklearn GridSearchCV (Scoring Function error)
问题描述
我想知道您是否可以帮助我解决在运行网格搜索时收到的错误.我认为这可能是由于对网格搜索实际工作方式的误解造成的.
I was wondering if you can help me out with an error I am receiving in running grid search. I think it might due to misunderstanding on how grid search actually works.
我现在正在运行一个应用程序,在该应用程序中我需要进行网格搜索以使用不同的评分功能来评估最佳参数.我正在使用RandomForestClassifier将大型X数据集拟合到特征向量Y,该向量是0和1的列表. (完全二进制).我的评分功能(MCC)要求预测输入和实际输入必须完全是二进制的.但是,由于某些原因,我不断收到ValueError:不支持multiclass.
I am now running an application where I need grid search to evaluate best parameters using a different scoring function. I am using RandomForestClassifier to fit a large X dataset to a characterization vector Y which is a list of 0s and 1s. (completely binary). My scoring function (MCC) requires the prediction input and actual input to be completely binary. However, for some reason I keep getting the ValueError: multiclass is not supported.
我的理解是,网格搜索对数据集进行交叉验证,并提供基于交叉验证的预测输入,然后将特征向量和预测插入到函数中.由于我的特征向量完全是二进制的,因此我的预测向量也应该也是二进制的,并且在评估分数时不会造成任何问题. 当我使用单个定义的参数运行随机森林(不使用网格搜索)时,将预测的数据和特征向量插入MCC评分函数运行得很好.因此,我对运行网格搜索将如何导致任何错误感到有些困惑.
My understanding is that the grid search, does cross validation on the data set, comes up with a prediction input that is based on the cross validation, then insets the characterization vector and the prediction into the function. Since my characterization vector is completely binary, my prediction vector should also be binary as well and cause no problem when evaluating the score. When I run random forest with a single defined parameter (without using grid search), inserting the predicted data and characterization vector into MCC scoring functions runs perfectly fine. So I am a little lost on how running the grid search would cause any errors.
数据快照:
print len(X)
print X[0]
print len(Y)
print Y[2990:3000]
17463699
[38.110903683955435, 38.110903683955435, 38.110903683955435, 9.899495124816895, 294.7808837890625, 292.3835754394531, 293.81494140625, 291.11065673828125, 293.51739501953125, 283.6424865722656, 13.580912590026855, 4.976086616516113, 1.1271398067474365, 0.9465181231498718, 0.5066819190979004, 0.1808401197195053, 0.0]
17463699
[0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
代码:
def overall_average_score(actual,prediction):
precision = precision_recall_fscore_support(actual, prediction, average = 'binary')[0]
recall = precision_recall_fscore_support(actual, prediction, average = 'binary')[1]
f1_score = precision_recall_fscore_support(actual, prediction, average = 'binary')[2]
total_score = matthews_corrcoef(actual, prediction)+accuracy_score(actual, prediction)+precision+recall+f1_score
return total_score/5
grid_scorer = make_scorer(overall_average_score, greater_is_better=True)
parameters = {'n_estimators': [10,20,30], 'max_features': ['auto','sqrt','log2',0.5,0.3], }
random = RandomForestClassifier()
clf = grid_search.GridSearchCV(random, parameters, cv = 5, scoring = grid_scorer)
clf.fit(X,Y)
错误:
ValueError Traceback (most recent call last)
<ipython-input-39-a8686eb798b2> in <module>()
18 random = RandomForestClassifier()
19 clf = grid_search.GridSearchCV(random, parameters, cv = 5, scoring = grid_scorer)
---> 20 clf.fit(X,Y)
21
22
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
730
731 """
--> 732 return self._fit(X, y, ParameterGrid(self.param_grid))
733
734
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
503 self.fit_params, return_parameters=True,
504 error_score=self.error_score)
--> 505 for parameters in parameter_iterable
506 for train, test in cv)
507
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
657 self._iterating = True
658 for function, args, kwargs in iterable:
--> 659 self.dispatch(function, args, kwargs)
660
661 if pre_dispatch == "all" or n_jobs == 1:
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
404 """
405 if self._pool is None:
--> 406 job = ImmediateApply(func, args, kwargs)
407 index = len(self._jobs)
408 if not _verbosity_filter(index, self.verbose):
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
138 # Don't delay the application, to avoid keeping the input
139 # arguments in memory
--> 140 self.results = func(*args, **kwargs)
141
142 def get(self):
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1476
1477 else:
-> 1478 test_score = _score(estimator, X_test, y_test, scorer)
1479 if return_train_score:
1480 train_score = _score(estimator, X_train, y_train, scorer)
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
1532 score = scorer(estimator, X_test)
1533 else:
-> 1534 score = scorer(estimator, X_test, y_test)
1535 if not isinstance(score, numbers.Number):
1536 raise ValueError("scoring must return a number, got %s (%s) instead."
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/metrics/scorer.pyc in __call__(self, estimator, X, y_true, sample_weight)
87 else:
88 return self._sign * self._score_func(y_true, y_pred,
---> 89 **self._kwargs)
90
91
<ipython-input-39-a8686eb798b2> in overall_average_score(actual, prediction)
3 recall = precision_recall_fscore_support(actual, prediction, average = 'binary')[1]
4 f1_score = precision_recall_fscore_support(actual, prediction, average = 'binary')[2]
----> 5 total_score = matthews_corrcoef(actual, prediction)+accuracy_score(actual, prediction)+precision+recall+f1_score
6 return total_score/5
7 def show_score(actual,prediction):
/shared/studies/nonregulated/neurostream/neurostream/local/lib/python2.7/site-packages/sklearn/metrics/classification.pyc in matthews_corrcoef(y_true, y_pred)
395
396 if y_type != "binary":
--> 397 raise ValueError("%s is not supported" % y_type)
398
399 lb = LabelEncoder()
ValueError: multiclass is not supported
推荐答案
马修斯相关系数是-1和1之间的一个分数.因此,计算f1_score,精度,召回率,precision_score和MCC之间的平均值是不正确的.
Matthews Correlation Coefficient is a score between -1 and 1. So, it is not correct to calculate the average between f1_score, precision, recall, accuracy_score and MCC.
MCC值指示: 1是总正相关 0无相关 -1是总负相关
MCC values indicate: 1 is total positive correlation 0 is no correlation −1 is total negative correlation
虽然上面提到的其他评估指标介于0和1之间(从最差到最佳准确度指标).范围和意义不同.
While the other above mentioned evaluation metrics are between 0 and 1 (from worst to best accuracy index). The range and the significance is not the same.
这篇关于sklearn GridSearchCV(评分函数错误)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!