sklearn:在测试数据集上计算 k-means 的准确度分数 [英] sklearn: calculating accuracy score of k-means on the test data set

查看:35
本文介绍了sklearn:在测试数据集上计算 k-means 的准确度分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对具有 2 个集群的 30 个样本集进行 k 均值聚类(我已经知道有两个类).我将我的数据分成训练集和测试集,并尝试计算我的测试集的准确度分数.但是有两个问题:首先我不知道我是否真的可以为 k-means 聚类做到这一点(测试集的准确度分数).第二:如果允许我这样做,无论我的实现是对还是错.这是我尝试过的:

I am doing k-means clustering on the set of 30 samples with 2 clusters (I already know there are two classes). I divide my data into training and test set and try to calculate the accuracy score on my test set. But there are two problems: first I don't know if I can actually do this (accuracy score on test set) for k-means clustering. Second: if I am allowed to do this, whether my implementation is right or wrong. Here is what I've tried:

df_hist = pd.read_csv('video_data.csv')

y = df_hist['label'].values
del df_hist['label']
df_hist.to_csv('video_data1.csv')
X = df_hist.values.astype(np.float)

X_train, X_test,y_train,y_test =  cross_validation.train_test_split(X,y,test_size=0.20,random_state=70)
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X_train)
print(k_means.labels_[:])
print(y_train[:])

score = metrics.accuracy_score(y_test,k_means.predict(X_test))
print('Accuracy:{0:f}'.format(score))

k_means.predict(X_test)
print(k_means.labels_[:])
print(y_test[:])

但是,当我为测试集打印 k-means 标签时 (k_means.predict(X_test) print(k_means.labels_[:])) 和 y_test 标签(print(k_means.labels_[:])) 在最后三行中,我得到的标签与我安装 X-train 时的标签相同,而不是为X 检验.知道我在这里可能做错了什么吗?我正在做的评估 k 均值的性能是否正确?谢谢!

But, when I print k-means labels for the test set (k_means.predict(X_test) print(k_means.labels_[:])) and y_test labels (print(k_means.labels_[:])) in the last three lines, I get the same label as the ones when I was fitting the the X-train, rather than the labels that were produced for the X-test. Any idea what I might be doing wrong here? Is it right at all what I'm doing to evaluate the performance of k-means? Thank you!

推荐答案

在评估准确性方面.您应该记住,k-means 不是分类工具,因此分析准确性不是一个好主意.你可以这样做,但这不是 k-means 的用途.它应该找到一组最大化集群间距离的数据,它不使用您的标签进行训练.因此,通常使用 RandIndex 和其他聚类指标之类的东西来测试 k-means 之类的东西.为了最大限度地提高准确性,您应该适合实际的分类器,例如 kNN、逻辑回归、SVM 等.

In terms of evaluating accuracy. You should remember that k-means is not a classification tool, thus analyzing accuracy is not a very good idea. You can do this, but this is not what k-means is for. It is supposed to find a grouping of data which maximizes between-clusters distances, it does not use your labeling to train. Consequently, things like k-means are usually tested with things like RandIndex and other clustering metrics. For maximization of accuracy you should fit actual classifier, like kNN, logistic regression, SVM, etc.

就代码本身而言,k_means.predict(X_test) 返回标签,它不会更新内部的labels_字段,你应该做

In terms of the code itself, k_means.predict(X_test) returns labeling, it does not update the internal labels_ field, you should do

print(k_means.predict(X_test))

此外,在 python 中,您不必(也不应该)使用 [:] 打印数组,只需执行

Furthermore in python you do not have to (and should not) use [:] to print an array, just do

print(k_means.labels_)
print(y_test)

这篇关于sklearn:在测试数据集上计算 k-means 的准确度分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆