如何自动执行集群数量? [英] How do I automate the number of clusters?

查看:117
本文介绍了如何自动执行集群数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用以下脚本:

I've been playing with the below script:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os

folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}

# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
    for file in files:
        full_path = os.path.join(root, file)
        print(f'Processing {file}')
        try:
            text = textract.process(full_path)
            dict_of_docs[file] = text
        except Exception as e:
            print(e)


vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)

它将扫描作为扫描文档的图像文件夹,提取文本,然后将文本聚类.我知道事实上有3种不同类型的文档,所以我将true_k设置为3.但是如果我有一个未知文档的文件夹,那里可能有1到100多种不同文档类型的内容.

It scans a folder of images that are scanned documents, extracts the text then clusters the text. I know for a fact there are 3 different types of documents, so I set the true_k to 3. But what if I had a folder of unknown documents where there could be anythings from 1 to 100s of different document types.

推荐答案

这是一个易滑的领域,因为在没有任何地面真实性标签的情况下,很难衡量您的聚类算法的良好"性能.为了进行自动选择,您需要具有一个度量标准,该度量标准将比较KMeans对于不同值n_clusters的表现.

This is a slippery field because it is very difficult to measure how "good" your clustering algorithm works without any ground truth labels. In order to make an automatic selection, you need to have a metrics that will compare how KMeans performs for different values of n_clusters.

一个流行的选择是轮廓分数.您可以在此处找到更多详细信息.这是scikit-learn文档:

A popular choice is the silhouette score. You can find more details about it here. Here is the scikit-learn documentation:

使用每个样本的平均集群内距离(a)和平均最近集群距离(b)计算轮廓系数.样本的轮廓系数为(b-a)/max(a,b).为了明确起见,b是样本与该样本不属于的最近群集之间的距离.请注意,仅当标签数为2≤n_labels≤n_samples-1时,才定义轮廓系数.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

结果,您只能计算n_clusters >= 2的轮廓分数(不幸的是,考虑到问题的描述,这可能是对您的限制).

As a result, you can only compute the silhouette score for n_clusters >= 2, (which might be a limitation for you given your problem description unfortunately).

这是在虚拟数据集上使用它的方式(您可以将其适应您的代码,这只是一个可复制的示例):

This is how you would use it on a dummy data set (you can adapt it to your code then, it is just to have a reproducible example):

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

sil_score_max = -1 #this is the minimum possible score

for n_clusters in range(2,10):
  model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
  labels = model.fit_predict(X)
  sil_score = silhouette_score(X, labels)
  print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
  if sil_score > sil_score_max:
    sil_score_max = sil_score
    best_n_clusters = n_clusters

这将返回:

The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31

因此,您将拥有best_n_clusters = 2(注意:实际上,鸢尾花有3个类...)

And thus you will have best_n_clusters = 2 (NB: in reality, Iris has three classes...)

这篇关于如何自动执行集群数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆