词向量的K均值聚类(300维) [英] K- Means clustering for word vector (300 dimension)
问题描述
我正在编写一个程序,需要对该程序对200多个300个元素的数组的数据集应用K-means聚类. 有人可以给我提供代码解释的链接吗? 1.通过肘法求k 2.应用k均值方法并获得质心的数组
I am writing a program for which I need to apply K-means clustering over a data set of some >200, 300-element arrays. Could someone provide me with a link to code with explanations on- 1. finding the k through the elbow method 2. applying the k means method and getting the arrays for the centroids
我自己搜索了上面的内容,但没有找到清楚的代码说明. P.s.我正在Google Colab上工作,因此,如果有相同的特定方法,请提出建议
I have searched for the above on my own but have not found any with clear explanations of the code. P.s. I am working on Google Colab, so if there are specific methods for the same, do suggest
我尝试了以下代码,但是,我不断收到以下错误-
I tried the below code, however, I keep getting the following error-
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-70-68e300fd4bf8> in <module>()
24
25 # step 1: find optimal k (number of clusters)
---> 26 find_best_k()
27
3 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: setting an array element with a sequence.
推荐答案
假设有12个样本,每个样本具有以下两个功能:
Suppose there are 12 samples each with two features as below:
data=np.array([[1,1],[1,2],[2,1.5],[4,5],[5,6],[4,5.5],[5,5],[8,8],[8,8.5],[9,8],[8.5,9],[9,9]])
您可以使用弯头法和簇中心找到最佳簇数,如下例所示:
You can find the optimal number of clusters using elbow method and the centers of clusters as the following example:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data=np.array([[1,1],[1,2],[2,1.5],[4,5],[5,6],[4,5.5],[5,5],[8,8],[8,8.5],[9,8],[8.5,9],[9,9]])
def find_best_k():
sum_of_squared_distances = []
K=range(1,8) # change 8 in your data
for k in K:
km=KMeans(n_clusters=k)
km=km.fit(data)
sum_of_squared_distances.append(km.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances')
plt.title('Elbow method for optimal k')
plt.show()
#The plot looks like an arm, and the elbow on the arm is optimal k.
# step 1: find optimal k (number of clusters)
find_best_k()
def run_kmeans(k,data): # k is the optimal number of clusters
km=KMeans(n_clusters=k)
km=km.fit(data)
centroids = km.cluster_centers_ #get the center of clusters
#print(centroids)
return centroids
def plotresults():
centroids=run_kmeans(3,data)
plt.plot(data[0:3,0],data[0:3,1],'ro',data[3:7,0],data[3:7,1],'bo',data[7:12,0],data[7:12,1],'go')
for i in range(3):
plt.plot(centroids[i,0],centroids[i,1],'k*')
plt.text(centroids[i,0],centroids[i,1], "c"+str(i), fontsize=12)
plotresults()
手肘图:
结果:
希望这会有所帮助.
这篇关于词向量的K均值聚类(300维)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!