sklearn:在测试数据集上计算k均值的准确性得分 [英] sklearn: calculating accuracy score of k-means on the test data set

查看：163 发布时间：2020/4/26 10:20:14 python scikit-learn k-means

本文介绍了sklearn:在测试数据集上计算k均值的准确性得分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在对30个带有2个聚类的样本进行k均值聚类(我已经知道有两个类).我将数据分为训练集和测试集，然后尝试计算测试集的准确性得分.但是有两个问题:首先，我不知道我是否真的可以针对k均值聚类做到这一点(测试集上的准确性得分).第二:如果我被允许这样做，那么我的实现是写的还是错误的.这是我尝试过的:

I am doing k-means clustering on the set of 30 samples with 2 clusters (I already know there are two classes). I divide my data into training and test set and try to calculate the accuracy score on my test set. But there are two problems: first I don't know if I can actually do this (accuracy score on test set) for k-means clustering. Second: if I am allowed to do this, whether my implementation is write or wrong. Here is what I've tried:

df_hist = pd.read_csv('video_data.csv')

y = df_hist['label'].values
del df_hist['label']
df_hist.to_csv('video_data1.csv')
X = df_hist.values.astype(np.float)

X_train, X_test,y_train,y_test =  cross_validation.train_test_split(X,y,test_size=0.20,random_state=70)
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X_train)
print(k_means.labels_[:])
print(y_train[:])

score = metrics.accuracy_score(y_test,k_means.predict(X_test))
print('Accuracy:{0:f}'.format(score))

k_means.predict(X_test)
print(k_means.labels_[:])
print(y_test[:])

但是，当我为测试集打印k-means标签时( k_means.predict(X_test) print(k_means.labels _ [:]))和y_test标签( print(k_means.labels _ [:]))在最后三行中，与安装X火车时得到的标签相同，而不是为X火车生产的标签. X检验.知道我在这里可能做错了什么吗?我在评估k均值性能时所做的一切对吗? 谢谢！

But, when I print k-means labels for the test set (k_means.predict(X_test) print(k_means.labels_[:])) and y_test labels (print(k_means.labels_[:])) in the last three lines, I get the same label as the ones when I was fitting the the X-train, rather than the labels that were produced for the X-test. Any idea what I might be doing wrong here? Is it right at all what I'm doing to evaluate the performance of k-means? Thank you!

推荐答案

在评估准确性方面.您应该记住，k均值不是分类工具，因此分析准确性不是一个好主意.您可以执行此操作，但这不是k-means的目的.它应该找到一组最大的集群之间距离的数据，而不使用您的标签进行训练.因此，通常使用RandIndex和其他聚类指标来测试k均值之类的东西.为了最大程度地提高准确性，您应该适合实际的分类器，例如kNN，逻辑回归，SVM等.

In terms of evaluating accuracy. You should remember that k-means is not a classification tool, thus analyzing accuracy is not a very good idea. You can do this, but this is not what k-means is for. It is supposed to find a grouping of data which maximizes between-clusters distances, it does not use your labeling to train. Consequently, things like k-means are usually tested with things like RandIndex and other clustering metrics. For maximization of accuracy you should fit actual classifier, like kNN, logistic regression, SVM, etc.

就代码本身而言，k_means.predict(X_test) 返回标签，它不会更新内部的labels_字段，您应该这样做

In terms of the code itself, k_means.predict(X_test) returns labeling, it does not update the internal labels_ field, you should do

print(k_means.predict(X_test))

此外，在python中，您不必(也不应该)使用[:]来打印数组，只需这样做

Furthermore in python you do not have to (and should not) use [:] to print an array, just do

print(k_means.labels_)
print(y_test)

这篇关于sklearn:在测试数据集上计算k均值的准确性得分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

sklearn:在测试数据集上计算k均值的准确性得分 [英] sklearn: calculating accuracy score of k-means on the test data set

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

sklearn:在测试数据集上计算k均值的准确性得分 [英] sklearn: calculating accuracy score of k-means on the test data set

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭