ValueError:标签数为1.使用silhouette_score时,有效值为2到n_samples-1(包括1) [英] ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

查看:2370
本文介绍了ValueError:标签数为1.使用silhouette_score时,有效值为2到n_samples-1(包括1)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算silhouette score,因为我找到了要创建的最佳群集数量,但是出现了一条错误消息:

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

我无法理解其原因.这是我用来聚类和计算silhouette score的代码.

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

我读取了包含要聚类的文本的csv,并在n聚类值上运行K-Means.我可能会收到此错误的原因是什么?

I read the csv that contains the text to be clustered and run K-Means on the n cluster values. What could be the reason I am getting this error?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)

推荐答案

产生错误是因为您有一个循环,用于循环访问不同数量的集群n.在第一次迭代中, n_clusters1 ,并且这导致all(km.labels_ == 0)成为True.

The error is produced because you have a loop for different number of clusters n. During the first iteration, n_clusters is 1 and this leads to all(km.labels_ == 0)to be True.

换句话说,您只有一个标签为0的群集(因此,np.unique(km.labels_)打印array([0], dtype=int32)).

In other words, you have only one cluster with label 0 (thus, np.unique(km.labels_) prints array([0], dtype=int32)).

示例:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

我们有3个不同的集群/集群标签.

We have 3 different clusters/cluster labels.

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

该功能正常工作.

现在,让我们引起错误:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

这篇关于ValueError:标签数为1.使用silhouette_score时,有效值为2到n_samples-1(包括1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆