KNN模型的准确性得分(IRIS数据) [英] Accuracy score for a KNN model (IRIS data)

查看:433
本文介绍了KNN模型的准确性得分(IRIS数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此基本KNN模型在IRIS数据上提高或稳定准确性得分请勿更改的显着差异)的关键因素有哪些?

What might be some key factors for increasing or stabilizing the accuracy score (NOT TO significantly vary) of this basic KNN model on IRIS data?

from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris() 
X, y = iris.data[:, :], iris.target

Xtrain, Xtest, y_train, y_test = train_test_split(X, y)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

knn = neighbors.KNeighborsClassifier(n_neighbors=4)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))



样本准确性得分



Sample Accuracy Scores

0.9736842105263158
0.9473684210526315
1.0
0.9210526315789473



分类报告



Classification Report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.79      1.00      0.88        11
           2       1.00      0.80      0.89        15

    accuracy                           0.92        38
   macro avg       0.93      0.93      0.92        38
weighted avg       0.94      0.92      0.92        38



样本混淆矩阵



Sample Confusion Matrix

[[12  0  0]
 [ 0 11  0]
 [ 0  3 12]]


推荐答案

我建议调整 k k-NN的值。由于鸢尾花是一个很小的数据集,并且平衡良好,因此我将执行以下操作:

I would recommend tuning the k value for k-NN. As iris is a small dataset and nicely balanced, I will do the following:


For every value of `k` in range [2 to 10] (say)
  Perform a n-times k-folds crossvalidation (say n=20 and k=4)
    Store the Accuracy values (or any other metric)

根据平均值和方差绘制分数,并选择具有最佳k值的 k 值。交叉验证的主要目标是估计测试误差,然后根据您选择最终模型。会有一些差异,但应小于0.03或类似的值。这取决于数据集和折叠次数。一个好的过程是,对于每个 k 值,都要对所有20x4精度值进行箱形图绘制。选择 k 的值,下分位数与上分位数相交,或者简单地说,准确性(或其他度量值)的变化不大。

Plot the scores based on the average and variance and select the value of k with the best k-value. The main target of crossvalidation is to estimate the test error, and based on that you select the final model. There will be some variance, but it should be less than 0.03 or something like that. That depends on the dataset and the number of folds you take. One good process is, for each value of k make a boxplot of all the 20x4 Accuracy values. Select the value of k for which the lower quantile intersects the upper quantile, or in simple words, in there is not too much change in the accuracy (or other metric values).

基于此选择 k 的值后,目标是使用该值构建最终值使用整个训练数据集进行建模。接下来,这可以用于预测新数据。

Once you select the value of k based on this, the target is to use this value to build the final model using the entire training dataset. Next, this can be used to predict new data.

另一方面,对于较大的数据集。制作一个单独的测试分区(如您在此处所做的那样),然后仅对训练集调整 k 的值(使用交叉验证,而不必考虑测试集)。选择适当的 k 训练算法后,仅使用训练集进行训练。接下来,使用测试集报告最终值。永远不要根据测试集做出任何决定。

On the other hand, for larger datasets. Make a separate test partition (as you did here), and then tune the k value on only the training set (using crossvalidation, forget about the test set). After selecting an appropriate k train the algorithm, use only the training set to train. Next, use the test set to report the final value. Never take any decision based on the test set.

另一种方法是训练,验证和测试分区。使用训练集训练,并使用不同的 k 值训练模型,然后使用验证分区进行预测并列出分数。根据此验证分区选择最佳分数。接下来,使用根据验证集选择的 k 值,使用train或train + validation集合来训练最终模型。最后,取出测试集并报告最终分数。同样,永远不要在其他任何地方使用测试集。

Yet another method is train, validation, test partition. Train using the train set, and train models using different values of k , and then predict using the validation partition and list the scores. Select the best score based on this validation partition. Next use the train or train+validation set to train the final model using the value of k selected based on the validation set. Finally, take out the the test set and report the final score. Again, never use the test set anywhere else.

这些是适用于任何机器学习或统计学习方法的常规方法。

These are general methods applicable to any machine learning or statistical learning methods.

执行分区(训练,测试或交叉验证)时要注意的重要事项,请使用分层,以便在每个分区中的类比率保持不变。

Immportant thing to note when you perform the partition (train,test or for crossvalidation), use stratified sampling so that in each partition the class ratios stay the same.

了解有关交叉验证。在scikitlearn中很容易做到。如果使用R,则可以使用插入符号

Read more about crossvalidation. In scikitlearn it is easy to do. If using R, you can use the caret.

主要要记住,目标是训练一个泛化新数据或在新数据上表现良好的功能,而不仅仅是在现有数据上表现不佳。

Main thing to remember that the target is to train a function which generalises on new data, or performs well on new data, and not perform not only perform good on the existing data.

这篇关于KNN模型的准确性得分(IRIS数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆