通过在实例上使用分类器的置信度来提高预测分数 [英] Improving the prediction score by use of confidence level of classifiers on instances

查看:163
本文介绍了通过在实例上使用分类器的置信度来提高预测分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下三个分类器(RandomForestClassifierKNearestNeighborClassifierSVM Classifier):

I am using three classifiers (RandomForestClassifier, KNearestNeighborClassifier, and SVM Classifier) which you can see below:

>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=41, shrinking=True,
  tol=0.001, verbose=False)

>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

在训练期间,RandomForestClassifer根据数据的预测给出最佳的f1_score,其次是KNearestNeighborClassifier,然后是SVMClassifier.这是我的X_train(标准比例值,如果需要,您可以问一下我是怎么得到的)& y_train:

During training, RandomForestClassifer gives the best f1_score followed by KNearestNeighborClassifier, and then SVMClassifier on the predictions from the data. Here is my X_train (standard scaled values, if needed you can ask how I got this) & y_train:

>> X_train
array([[-0.11034393, -0.72380296,  0.15254572, ...,  0.4166148 ,
        -0.91095473, -0.91095295],
       [ 1.6817184 ,  0.40040944, -0.6770607 , ..., -0.2403781 ,
         0.02962478,  0.02962424],
       [ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
        -0.15848331, -0.15847739],
       ..., 
       [-1.18666853,  0.87297522,  0.47136779, ..., -0.19599824,
         0.72417473,  0.72416714],
       [ 1.6835304 ,  0.40605067, -0.63383059, ..., -0.37094083,
         0.09505496,  0.09505389],
       [ 0.19950709, -1.04624152, -0.18351693, ...,  0.4362658 ,
        -0.77994791, -0.77994176]])

>> y_train_sl
874     0
1863    0
1493    0
288     1
260     0
495     0
1529    0
1704    1
75      1
1792    0
626     0
99      1
222     0
774     0
52      1
1688    1
1770    0
53      1
1814    0
488     0
230     0
481     0
132     1
831     0
1166    1
1593    0
771     0
1785    0
616     0
207     0
       ..
155     1
1506    0
719     0
547     0
613     0
652     0
1351    0
304     0
1689    1
1693    1
1128    0
1323    0
763     0
701     0
467     0
917     0
329     0
375     0
1721    0
928     0
1784    0
1200    0
832     0
986     0
1687    1
643     0
802     0
280     1
1864    0
1045    0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8

如您所见,我的y_train是布尔形式(即实例为TrueFalse的地方.

As you can see my y_train is in Boolean form (i.e. where the instances are True and where False.

我想通过使用predict_proba进一步提高预测的准确性,以使我看到来自分类器的预测(首先说RandomForestClassifier)具有较低的置信度(<60%) )关于它预测的特定实例(这是我应该首先找到的),它将移至下一个分类器(假设为KNearestNeighborClassifier),并由下一个分类器检查那些实例的置信度,如果有的话与前一个分类器相比(> 60%)具有较高的置信度,而是接受该分类器的解决方案;类似地,如果该分类器在相同实例上仍然具有较低的置信度(<60%),则移至下一个分类器并执行对于第三分类器来说也是一样.

I want to improve the accuracy of the predictions further by use of predict_proba in such a way that when I see that predictions from the classifier (let's say RandomForestClassifier first) has a low confidence level (<60%) about particular instances it predicted (which is what I am supposed to find first), it moves to the next classifier (let's say KNearestNeighborClassifier) and check the confidence level of those instances by the next classifier on those instances, if it has a high confidence level compared to the previous classifier (>60%) accept the solution from that classifier instead, similarly if this classifier has a lower confidence level on the same instances still(<60%), move to the next classifier and do the same thing for the third classifier.

最后,如果第三分类器也具有较低的置信度(<60%),则我需要接受来自所有三个分类器中具有最高置信度的分类器的解决方案.

Finally, if the third classifier has a lower confidence level (<60%) too, I need to accept the solution from the classifier which has the highest confidence level among all three classifiers.

因为我是机器学习的新手,所以我可能会将您对某些我为之道歉的陈述感到困惑,因此只需在错误之处纠正我即可.

Since, I am new to Machine Learning I might be confusing you with some of the statements for which I apologize so just correct me where I am wrong.

X_test和y_test如下所示.我需要在X_test_prepared上进行预测,并使用f1_score评估预测和y_test_sl.预测的y必须通过所有三个分类器,并且对于所有实例都具有最佳置信度.

X_test and y_test are shown below. I need to predict on the X_test_prepared and evaluate the predictions and y_test_sl using f1_score. The predicted y must have passed through all three classifiers and has the best confidence levels for all the instances.

>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
         0.32099948,  0.32099952],
       [ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
        -0.54261829, -0.54261947],
       [ 1.67447042,  0.24530384, -1.0113221 , ..., -0.54844942,
        -0.26066608, -0.26066032],
       ...,
       [ 0.28104683,  1.52670909,  0.62653301, ..., -1.15596295,
         2.05859487,  2.05859247],
       [ 1.50595496,  0.84507934, -0.44109634, ..., -0.71277072,
         0.14474518,  0.14474398],
       [-1.63423112, -0.12690448,  0.48577783, ..., -0.36025459,
         0.29137477,  0.29137047]])

>> y_test_sl
1321    0
1433    0
1859    0
1496    0
492     0
736     0
996     0
1001    0
634     0
1486    0
910     0
1579    0
373     0
1750    0
1563    0
1584    0
51      1
349     0
1162    1
594     0
1121    0
1637    0
1116    0
106     1
1533    0
993     0
960     0
277     0
142     1
1010    0
       ..
1104    1
1404    0
1646    0
1009    0
61      1
444     0
10      1
704     0
744     0
418     0
998     0
740     0
465     0
97      1
1550    1
1738    0
978     0
690     0
1071    0
1228    1
1539    0
145     1
1015    0
1371    0
1758    0
315     0
71      1
1090    0
1766    0
33      1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8

推荐答案

最终目标是创建一个分类器集合,并采用所有分类器中最可信的"(最高概率分类)预测.代码如下:

The goal here turned out to create an ensemble of classifiers and take the most "confident" (highest probability class) predictions of all classifiers. The code is below:

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_features=4) # Put your training data here instead

# parameters for random forest
rfclf_params = {
    'bootstrap': True, 
    'class_weight':None, 
    'criterion':'entropy',
    'max_depth':None, 
    'max_features':'auto', 
    # ... fill in the rest you want here
}

# Fill in svm params here
svm_params = {
    'probability':True
}

# KNeighbors params go here
kneighbors_params = {

}

params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]

def ensemble(classifiers, params, X_train, y_train, X_test):
    best_preds = np.zeros((len(X_test), 2))
    classes = np.unique(y_train)

    for i in range(len(classifiers)):
        # Construct the classifier by unpacking params 
        # store classifier instance
        clf = classifiers[i](**params[i])
        # Fit the classifier as usual and call predict_proba
        clf.fit(X_train, y_train)
        y_preds = clf.predict_proba(X_test)
        # Take maximum probability for each class on each classifier 
        # This is done for every instance in X_test
        # see the docs of np.maximum here: 
        # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
        best_preds = np.maximum(best_preds, y_preds)

    # map the maximum probability for each instance back to its corresponding class
    preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
    return preds

# Test your predictions  
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))

如果您希望算法返回最高概率而不是预测的类,请使ensemble返回[np.amax(pred_probs) for pred_probs in best_preds]而不是preds.

If you want the algorithm to return the highest probabilities instead of the predicted class, have ensemble return [np.amax(pred_probs) for pred_probs in best_preds] rather than preds.

这篇关于通过在实例上使用分类器的置信度来提高预测分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆