ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score [英] ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

查看:33
本文介绍了ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算 silhouette score,因为我找到了要创建的最佳聚类数,但收到错误消息:

ValueError: Number of labels is 1. 有效值为 2 to n_samples - 1 (inclusive)

我无法理解这样做的原因.这是我用来聚类和计算 silhouette score 的代码.

我读取了包含要聚类的文本的 csv,并对 n 聚类值运行 K-Means.我收到此错误的原因可能是什么?

 #使用K-Means创建集群#只创建图形导入 matplotlib#matplotlib.use('Agg')进口重新导入操作系统导入 nltk、数学、编解码器导入 csv从 nltk.corpus 导入停用词从 gensim.models 导入 Doc2Vec从 sklearn.cluster 导入 KMeans导入 matplotlib.pyplot 作为 plt将熊猫导入为 pd从 sklearn.metrics 导入剪影_分数模型名称 = checkpoint_save_pathLoaded_model = Doc2Vec.load(model_name)#加载测试csv文件数据 = pd.read_csv(test_filename)概览 = 数据['概览'].astype('str').tolist()概览 = 过滤器(布尔,概览)向量 = []def split_words(text):return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()def preprocess_document(text):sp_words = split_words(文本)返回 sp_words对于 i, t in enumerate(overview):向量.附加(loaded_model.infer_vector(preprocess_document(t)))sse = {}剪影 = {}对于范围内的 k(1,15):km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)sse[k] = km.inertia_#FOLLOWING LINE 导致错误剪影 [k] = 剪影分数(向量,km.labels_,metric='euclidean')best_cluster_size = 1min_error = float("inf")对于 sse 中的 cluster_size:如果 sse[cluster_size] <最小错误:min_error = sse[cluster_size]best_cluster_size = cluster_size打印(sse)打印(====")打印(剪影)

解决方案

错误 的产生是因为您有一个循环用于不同数量的簇 n.在第一次迭代中,n_clusters1 并且 这导致 all(km.labels_ == 0)True.

换句话说,你只有一个标签为 0 的簇(因此,np.unique(km.labels_) 打印 array([0],dtype=int32)).

<小时>

silhouette_score 需要 1 个以上的簇标签.这会导致错误.错误信息很明确.

<小时>

示例:

from sklearn import datasets从 sklearn.cluster 导入 KMeans将 numpy 导入为 np虹膜 = datasets.load_iris()X = 虹膜数据y = iris.targetkm = KMeans(n_clusters=3)km.fit(X,y)# 检查您有多少个唯一标签np.unique(km.labels_)#array([0, 1, 2], dtype=int32)

我们有 3 个不同的集群/集群标签.

silhouette_score(X, km.labels_, metric='euclidean')0.38788915189699597

功能正常.

<小时>

现在,让我们引发错误:

km2 = KMeans(n_clusters=1)km2.fit(X,y)剪影分数(X,km2.labels_,metric='欧几里得')

<块引用>

ValueError: Number of labels is 1. 有效值为 2 to n_samples - 1 (inclusive)

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

I read the csv that contains the text to be clustered and run K-Means on the n cluster values. What could be the reason I am getting this error?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)

解决方案

The error is produced because you have a loop for different number of clusters n. During the first iteration, n_clusters is 1 and this leads to all(km.labels_ == 0)to be True.

In other words, you have only one cluster with label 0 (thus, np.unique(km.labels_) prints array([0], dtype=int32)).


silhouette_score requires more than 1 cluster labels. This causes the error. The error message is clear.


Example:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.


Now, let's cause the error:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

这篇关于ValueError: Number of labels is 1. 有效值为 2 到 n_samples - 1 (inclusive) 当使用剪影_score的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆