K-Means聚类-输出聚类包含相同数量的元素,但顺序不同[Python] [英] K-Means Clustering - output clusters contains same number of elements but in different order [ Python ]

查看:24
本文介绍了K-Means聚类-输出聚类包含相同数量的元素,但顺序不同[Python]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遵循this tutorial对包含单个单词的列表执行K-Means聚类。这是一个基于板球的项目,所以我选择了K=3,这样以后我就可以将这三个群集区分为[击球,保龄球,防守]。但是,编译代码后,结果3个集群中的元素都相同,但顺序不同。我试着把最初的列表弄清楚,但也不能解决问题。附加下面的代码。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

len(finaldatatext)
#2173
vectorizer = TfidfVectorizer(stop_words='english')
#finaldatatext here is the list containing distinct elements
X = vectorizer.fit_transform(finaldatatext)

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

clusterlists = []
for i in range(true_k):
  dummy_list  = []
    for ind in order_centroids[i]:
      #print( '%s' % terms[ind])
      dummy_list.append('%s' % terms[ind])
  clusterlists.append(dummy_list)

示例初始列表为:

['anymore', 'silly', 'fielders', 'fans', 'rcb', 'precedent', 'reputation', 'pool', 'International', 'famous', 'Astle', 'max', 'stadium', 'bennet', 'working', 'lassi', 'ameetasinh', 'meantime', 'com', 'on', 'little', 'saini', 'Kanos', 'telling', 'six', 'PrithviShaw', 'started', 'letting', 'wYB2P72Il2', 'chess', 'brainwashed', 'Stat', 'mediocre', 'Afridi', 'hopes', 'strength', 'jamieson', 'managed', '46th', 'finale', 'PaRtNeRShIP', 'Another', 'kind', 'exactly', 'Happybirthday', 'out', 'RidaNajamKhan', 'scoreline', 'Career', 'boiiiiiiiiiiiii', 'based', 'starting', 'Test', 'omnipresent', 'Hahaha', 'version', 'victory', 'desert', 'cowards', 'OUTDATED', 'nz', 'inspecting', 'honestly', 'wait', 'Unless', 'steadying', 'think', 'anyone', 'YER', 'rant', 'one', 'odis', 'BANTER', 'paav', 'Ug6cTFgG8U', 'aggressive', 'brought', 'workload', 'Wise', 'ca', 'Brilliant', 'twist', 'open', 'THROWS', 'bringing', 'till', 'starts', 'gives', 'wYB', 'fifty', 'SENA', 'baboon', 'punishment', 'summarized', 'feeling', 'pandya', 'Bangladesh', 'hurting', 'accent', 'Kid', 'well']

预期结果是三个不同的群集,它们具有唯一的值,我可以根据元素将它们分类为击球、保龄球和防守。当前是3个顺序不同的完全相同的群集。

print(Clusterlists[0])
#sample reduced result
['absence', 'zize6kysq2', 'flexibility', 'finally', 'finals', 'fined', 'finisher', 'firepower', 'fit', 'fitness', 'flaw', 'flaws', 'fleming', 'fluffed', 'frame', 'fluke', 'fn0uegxgss', 'focussed', 'foot', 'forget', 'forgot', 'form', 'format', 'forward', 'fought', 'fow', 'finale', 'final', 'filter', 'figures', 'fashioned', 'fast', 'fastest', 'fat', 'fatigue', 'fault', 'fav', 'featured', 'feel', 'feeling', 'feels', 'fees', 'feet', 'felt', 'ferguson', 'fewest', 'ffc4pfbvfr', 'ffs', 'field', 'fielder', 'fielders', 'fielding', 'fight', 'fow_hundreds', 'frankly', 'faridabad', 'given', 'giving', 'glad', 'glenn', 'gloves', 'god', 'gods', 'goes', 'going', 'gois', 'gon', 'gone', 'good', 'got', 'grand', 'grandhomme', 'grandmom', 'grandpa', 'grass', 'great', 'greatest', 'greatness', 'greig', 'grind', 'gives', 'gingers', 'free', 'gill', 'frontline','fulfilling', 'future', 'gaandu', 'gabbar', 'gajal_dalmia', 'gambhir', 'game', 'gangsta', 'geez', 'gem', 'genius', 'genuinely', 'gets', 'getter', 'getting', 'giant', 'giddy', 'fascinating', 'fared', 'groupby', 'drives', 'dropped', 'drowning', 'dube', 'dude', 'dumb', 'dumbass', 'duo', 'e3cli7hakf', 'e9fhdkxvvl', 'earlier', 'early', 'earned', 'easiest', 'easily', 'easy', 'economically', 'economy', 'edengarden', 'edge']
len(Clusterlists[0])
#1728
len(Clusterlists[1])
#1728
len(Clusterlists[2])
#1728

当前提供相同的值。请提供解决方案。提前谢谢。

Link of initial finaldatatext list converted to csv.

推荐答案

不久前我测试了一些代码来进行文本聚类。计算文本之间的距离有些不合常规,但如果您真的想这样做,您可以这样做。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("
")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

只需根据您的特定需要对其进行修改即可。

这篇关于K-Means聚类-输出聚类包含相同数量的元素,但顺序不同[Python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆