Python scikit-learn 每次运行后聚类结果的变化 [英] Changes of clustering results after each time run in Python scikit-learn

查看:98
本文介绍了Python scikit-learn 每次运行后聚类结果的变化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆句子,我想使用 scikit-learn 谱聚类对它们进行聚类.我已经运行了代码并得到了没有问题的结果.但是,每次运行它我都会得到不同的结果.我知道这是启动的问题,但我不知道如何解决它.这是我在句子上运行的代码的一部分:

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:

vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)

数据是一个句子列表.每次代码运行时,我的聚类结果都不同.如何使用光谱聚类获得一致的结果.我对 Kmean 也有同样的问题.这是我的 Kmean 代码:

Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:

vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)

感谢您的帮助.

推荐答案

在使用 k-means 时,您希望在 KMeans 中设置 random_state 参数(请参阅 random_statea href="http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">文档).将此设置为 int 或 RandomState实例.

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.

km = KMeans(n_clusters=number_of_k, init='k-means++', 
            max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)

这很重要,因为 k-means 不是确定性算法.它通常从一些随机的初始化过程开始,这种随机性意味着不同的运行将在不同的点开始.为伪随机数生成器设置种子可确保相同种子的随机性始终相同.

This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.

不过,我不确定光谱聚类示例.来自 random_state文档/code> 参数:一个伪随机数生成器,用于在 eigen_solver == 'amg' 和 K-Means 初始化时初始化 lobpcg 特征向量分解."OP 的代码似乎不包含在这些情况下,尽管设置参数可能值得一试.

I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

这篇关于Python scikit-learn 每次运行后聚类结果的变化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆