Matlab k-均值余弦将所有内容分配给一个群集 [英] Matlab k-means cosine assigns everything to one cluster

查看:113
本文介绍了Matlab k-均值余弦将所有内容分配给一个群集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Matlab的常规kmeans算法与L2归一化特征矩阵上的'Distance','cosine','EmptyAction','drop'配合使用,但遇到了问题. Matlab生成的输出只是将每个数据点分配给群集1.00000,即使k = 20,并且C中的所有质心都是NaN.有人对导致此问题的原因有任何建议吗?

I'm using Matlab's regular kmeans algorithm with 'Distance','cosine','EmptyAction','drop' on an L2-normalized feature matrix and I have a problem. The output that Matlab generates is simply assigning EVERY datapoint to cluster 1.00000, even if k=20, and all centroids in C are NaN. Does anyone have any suggestions as to what might be causing this?

矩阵的布局为([0,1,...,1,0,1],[...],[0,1,...,1,0,1]).在将文件传递给Matlab之前,我已经使用Python的numpy.linalg.norm完成了L2标准化.这是我运行kmeans的确切方法:

The layout of the matrix is ([0,1,...,1,0,1],[...],[0,1,...,1,0,1]). I've done the L2-normalization using Python's numpy.linalg.norm before I passed the file to Matlab. This is the exact way I am running kmeans:

m=importdata('matrix.txt');
data=m'; % transpose, because kmeans treats columns as features instead of rows
[L, C]=kmeans(data, 20, 'Distance', 'cosine', 'EmptyAction', 'drop')

这是我归一化数据集的样本:

Here is a sample of my normalized dataset:

10.3440804328
12.6885775404
15.5884572681
15.9059737206
17.4355957742
17.0
17.3493515729
17.3205080757
18.6279360102
19.7230829233
21.400934559
22.0
22.5831795813
23.0
24.0416305603
25.2388589282
26.8141753556
22.5388553392
9.2736184955
13.5277492585
15.2970585408

任何帮助或建议将不胜感激.如果您需要更多信息,请告诉我!

Any help or suggestions would be greatly appreciated. If you need more information let me know!

推荐答案

正是余弦距离导致失败,它与sqEuclidean一起使用.我认为余弦距离需要更多信息,否则对您的数据集没有意义.

It is the cosine distance that is making it fail, it works with sqEuclidean. I think the cosine distance needs more info, or else doesn't make sense on your data set.

我会同意你的观点,这里的文档有点模糊...但是在Matlab的pdist函数中余弦距离的定义是:减去点之间夹角的余弦(视为矢量) )."

I will agree with you that the documentation is a little vague here...but the definition of cosine distance in the pdist function of Matlab is: "One minus the cosine of the included angle between points (treated as vectors)."

我认为,必须包括该角度(我在下一列中假设).但这似乎无法达到目的.余弦相似度 再次我猜include表示"2个向量之间的夹角"的可能性更大.在这种情况下,我认为余弦期望2列或更多列可以工作.

I take it from that, that the angle must be included(I am assuming in the next column). But that kind of seems like it defeats the purpose.cosine similarity Edit again: I guess it is more likely that included means "the included angle between 2 vectors". In this case I think cosine expects 2 or more columns to work on.

此外,如果您已经熟悉python,那么那里也有一些不错的机器学习工具.这是我曾经使用过的.还有 MILK ,但我自己从未使用过.

Also, if your already into python there are some good machine learning tools there as well. Here is one I have used. There is also MILK, but I have never used it myself.

这篇关于Matlab k-均值余弦将所有内容分配给一个群集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆