如何使用 Scikit-Learn 和 Python 找到最佳集群数 [英] How to find optimal number of clusters with Scikit-Learn and Python

查看:55
本文介绍了如何使用 Scikit-Learn 和 Python 找到最佳集群数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Pythons scikit-learn 库学习聚类,但我找不到找到最佳聚类数的方法.我试图制作一个簇数列表并将其传递到 for loop 中,并查看 elbow 但我想找到更好的解决方案.只有当我在 range(1,11) 行变得非常平滑并且我看不到 elbow 之后,这种方式才有效.我尝试过 silhouette_score 但我得到的值非常低,有时是负值.

I'm learning clustering with Pythons scikit-learn lib but I cant find a way to find the optimal number of clusters. I have tried to make a list of numbers of clusters and to pass it in for loop, and to see elbow but I want to find better solution. This way works only if i do it for the range(1,11) after that line becomes very smooth and I cant see the elbow. I have tried silhouette_score but I get very low values, sometimes negative.

另外,我使用文本数据,我写了几句可以*(假设)分组的句子,我有关于房子/家、关于学习、聚会、食物的句子......

Also, Im using text data, i wrote a couple of sentences that can be *(let say) grouped, i have sentences about house/home, about studying, parties, food....

我是否有可能获得 silhouette_score 的低值,因为我使用的是文本数据,我还需要在 cv.fit_transform(doc) 之后缩放数据吗?

Is there a chance that Im getting low values of silhouette_score because Im using text data, also do I need to scale the data after cv.fit_transform(doc)?

有没有更好的方法,也许是一些函数可以返回最佳簇数的 integer 值?例如 1,2,3,4....n

Is there any better way, maybe some function that will return the integer value of the optimal number of clusters? For example 1,2,3,4....n

这是我写的代码:

import sklearn.metrics as sm

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans, SpectralClustering, MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt

doc = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'let s go to the beach', 'how can we do this',
     'i love this product', 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right', ' lets go to the party', 'we were at the party last night', 
     'this is my favourite restaurant, I love their food, its so good','i love healty food', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs', "i'm on the road again", 
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip, it was amazing', 'Party last night was so boring', 'lets go on road trip', 'this is my home, im living there for 26 years',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band', 'true love', 'party was great','home sweet home',
     'I cant believe that you did that', 'Why are you doing that, I do not get it', 'this is tasty', 'this song is amazing', 'this food is tasty', 'lets go to the cinema', 'lets get together at my house',
     'I need to study for the test', 'I cant go out this weekend', 'I had a great time last night', 'I went out last night and it was amazing', 'you are beautiful', 'we crashed the party',
     'this is the best song i have ever heard', 'i love listening to music', 'music is my life', 'this song is terrible', 'how was your hollyday', 'i do not understand you, I have told you that last night',
      'I know whats best for you', 'I m on collage now', 'this is my favourite subject', 'math is fun', 'i love to study maths', 'programming is my live', 'i need to study, my final exam is tomorrow',
      'i m cooming home', 'i need to clean my house', 'what do you thing about last night', 'lets go out, my house is a mess', 'Im staying at home tonight', 'love is such a beautiful word',
      'i want to buy new house for me and my family', 'im will be home in a couple of hours', 'im working on a science project', 'working is hard and i need to work', 'you need to find a job',
       'this is bad, and we cant do anything about that', 'real estate market is growing', 'im selling my appartment', 'i live at the appartment above', 'i m into real estate', 'prices are going down',
       'i m building house of cards', 'I feel so tired, i was studying all nigh long', 'i was playing piano for more than 10 years and I was pretty good at it','I have never done that in my life',
       'i will buy this product in a couple of days', 'i m buying new phone next month', 'my home is near by', 'i m living in my home', 'i live in my parents house', 'i m living in my appartment',
       'my phone is very slow', 'do you know password for wifi', 'wifi is short for wireless network', 'you are so funny', 'my neighbours are horrible', 'such a nice phone, im glad to have it',
       'last time we went into that club and it was so boring', 'if I were you, i would never said that', 'you done very good work, your boss is very proud of you', 'Overall, I like this place a lot',
       'I was spending money on wrong things', 'whats the price for this item', 'where can I buy it', 'is it for sale', 'This hole in the wall has great Mexican street tacos, and friendly staff'
       'The movie showed a lot of Florida at it s best, made it look very appealing', 'This short film certainly pulls no punches', 'This is the kind of money that is wasted properly',
       'Not only did it only confirm that the film would be unfunny and generic, but it also managed to give away the ENTIRE movie', 'But it s just not funny','you have already done that',
       'I especially liked the non-cliche choices with the parents', 'it was well-paced and suited its relatively short run time']


cv = TfidfVectorizer(analyzer = 'word', max_features = 4000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(doc)

my_list = []
for i in range(1,10):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)
    silhouette_avg = silhouette_score(x, cluster_labels)
    print(silhouette_avg)

plt.plot(range(1,10),my_list)
plt.show()

推荐答案

通常来说,找到最佳集群数是一个难题,因为它没有唯一的解决方案,而且该问题不是确定性的(尤其是对于文本数据).此外,聚类问题的最优解是在你使用的模型背后优化给定度量的局部最优解,它存在大量的聚类模型.

Finding the optimal number of cluster is, in general, an hard problem as there isn't a unique solution to it and that problem is not deterministic (especially for text data). Furthermore, the optimal solution of a clustering problem is a local optimum that optimize a given measure behind the model you use, and it exists a large number of clustering models.

因此,自动学习文本数据正确"簇数的基线是所谓的分层狄利克雷过程 (HDP),它概括了潜在狄利克雷分配 (LDA) 模型.

Therefore, a baseline that automatically learn the "right" number of clusters for text data is the so called Hierarchical Dirichlet Process (HDP), which generalized the Latent Dirichlet Allocation (LDA) model.

您可以在 gensim 库.

这篇关于如何使用 Scikit-Learn 和 Python 找到最佳集群数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆