sklearn:在测试数据集上计算k均值的准确性得分 [英] sklearn: calculating accuracy score of k-means on the test data set

查看:163
本文介绍了sklearn:在测试数据集上计算k均值的准确性得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对30个带有2个聚类的样本进行k均值聚类(我已经知道有两个类).我将数据分为训练集和测试集,然后尝试计算测试集的准确性得分.但是有两个问题:首先,我不知道我是否真的可以针对k均值聚类做到这一点(测试集上的准确性得分).第二:如果我被允许这样做,那么我的实现是写的还是错误的.这是我尝试过的:

I am doing k-means clustering on the set of 30 samples with 2 clusters (I already know there are two classes). I divide my data into training and test set and try to calculate the accuracy score on my test set. But there are two problems: first I don't know if I can actually do this (accuracy score on test set) for k-means clustering. Second: if I am allowed to do this, whether my implementation is write or wrong. Here is what I've tried:

df_hist = pd.read_csv('video_data.csv')

y = df_hist['label'].values
del df_hist['label']
df_hist.to_csv('video_data1.csv')
X = df_hist.values.astype(np.float)

X_train, X_test,y_train,y_test =  cross_validation.train_test_split(X,y,test_size=0.20,random_state=70)
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X_train)
print(k_means.labels_[:])
print(y_train[:])

score = metrics.accuracy_score(y_test,k_means.predict(X_test))
print('Accuracy:{0:f}'.format(score))

k_means.predict(X_test)
print(k_means.labels_[:])
print(y_test[:])

但是,当我为测试集打印k-means标签时( k_means.predict(X_test) print(k_means.labels _ [:]))和y_test标签( print(k_means.labels _ [:]))在最后三行中,与安装X火车时得到的标签相同,而不是为X火车生产的标签. X检验.知道我在这里可能做错了什么吗?我在评估k均值性能时所做的一切对吗? 谢谢!

But, when I print k-means labels for the test set (k_means.predict(X_test) print(k_means.labels_[:])) and y_test labels (print(k_means.labels_[:])) in the last three lines, I get the same label as the ones when I was fitting the the X-train, rather than the labels that were produced for the X-test. Any idea what I might be doing wrong here? Is it right at all what I'm doing to evaluate the performance of k-means? Thank you!

推荐答案

在评估准确性方面.您应该记住,k均值不是分类工具,因此分析准确性不是一个好主意.您可以执行此操作,但这不是k-means的目的.它应该找到一组最大的集群之间距离的数据,而不使用您的标签进行训练.因此,通常使用RandIndex和其他聚类指标来测试k均值之类的东西.为了最大程度地提高准确性,您应该适合实际的分类器,例如kNN,逻辑回归,SVM等.

In terms of evaluating accuracy. You should remember that k-means is not a classification tool, thus analyzing accuracy is not a very good idea. You can do this, but this is not what k-means is for. It is supposed to find a grouping of data which maximizes between-clusters distances, it does not use your labeling to train. Consequently, things like k-means are usually tested with things like RandIndex and other clustering metrics. For maximization of accuracy you should fit actual classifier, like kNN, logistic regression, SVM, etc.

就代码本身而言,k_means.predict(X_test) 返回标签,它不会更新内部的labels_字段,您应该这样做

In terms of the code itself, k_means.predict(X_test) returns labeling, it does not update the internal labels_ field, you should do

print(k_means.predict(X_test))

此外,在python中,您不必(也不应该)使用[:]来打印数组,只需这样做

Furthermore in python you do not have to (and should not) use [:] to print an array, just do

print(k_means.labels_)
print(y_test)

这篇关于sklearn:在测试数据集上计算k均值的准确性得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆