Scikit K-means聚类绩效评估 [英] Scikit K-means clustering performance measure

查看:380
本文介绍了Scikit K-means聚类绩效评估的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用K-means方法进行聚类,但是我想衡量聚类的性能. 我不是专家,但我渴望了解有关群集的更多信息.

这是我的代码:

import pandas as pd
from sklearn import datasets

#loading the dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data)

#K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Species']

#we merge this dataframe with df
prediction = pd.concat([df,pred], axis = 1)

#We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
k_list = [clus0.values, clus1.values,clus2.values]

现在我已经存储了KMeans和三个集群,我正尝试使用 Dunn索引来衡量我的集群的性能(我们寻求更大的索引) 为此,我导入了 jqm_cvi软件包(可使用这里)

from jqmcvi import base
base.dunn(k_list)

我的问题是:Scikit Learn中是否已经存在任何聚类内部评估(除了Silhouette_score以外)?还是在另一个知名的图书馆中?

谢谢您的时间

解决方案

通常,将聚类视为一种无监督方法,因此很难建立良好的性能指标(正如前面的评论中所建议的那样).

但是,可以从这些算法中推断出很多有用的信息(例如k均值).问题在于如何为每个集群分配语义,从而衡量算法的性能".在许多情况下,一种很好的进行方式是通过群集的可视化.显然,如果您的数据具有高维特征(在许多情况下会发生这种情况),则可视化并不是那么容易.让我建议使用k均值和另一种聚类算法的两种方法.

  • K均值:在这种情况下,您可以使用例如 SOM ).您可以找到一个非常好的python软件包,名为 somoclu ,该软件包已实现并实现了直观显示结果的简便方法.此算法对聚类非常有用,因为不需要先验选择聚类数(以k均值表示,您需要选择k,此处为no).

I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I'm not an expert but I am eager to learn more about clustering.

Here is my code :

import pandas as pd
from sklearn import datasets

#loading the dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data)

#K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Species']

#we merge this dataframe with df
prediction = pd.concat([df,pred], axis = 1)

#We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
k_list = [clus0.values, clus1.values,clus2.values]

Now that I have my KMeans and my three clusters stored, I'm trying to use the Dunn Index to measure the performance of my clustering (we seek the greater index) For that purpose I import the jqm_cvi package (available here)

from jqmcvi import base
base.dunn(k_list)

My question is : does any clustering internal evaluation already exists in Scikit Learn (except from silhouette_score) ? Or in another well known library ?

Thank you for your time

解决方案

Normally, clustering is considered as an Unsupervised method, thus is difficult to establish a good performance metric (as also suggested in the previous comments).

Nevertheless, much useful information can be extrapolated from these algorithms (e.g. k-means). The problem is how to assign a semantics to each cluster, and thus measure the "performance" of your algorithm. In many cases, a good way to proceed is through a visualization of your clusters. Obviously, if your data have high dimensional features, as in many cases happen, the visualization is not that easy. Let me suggest two way to go, using k-means and another clustering algorithm.

  • K-mean: in this case, you can reduce the dimensionality of your data by using for example PCA. Using such algorithm, you can plot the data in a 2D plot and then visualize your clusters. However, what you see in this plot is a projection in a 2D space of your data, so can be not very accurate, but still can give you an idea of how your clusters are distributed.

  • Self-organizing map this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM). You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. This algorithm is very good for clustering also because does not require a priori selection of the number of cluster (in k-mean you need to choose k, here no).

这篇关于Scikit K-means聚类绩效评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆