KMeans聚类不平衡数据 [英] KMeans clustering unbalanced data

查看:645
本文介绍了KMeans聚类不平衡数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组具有50个特征(c1,c2,c3 ...)的数据,具有超过8万行.

I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows.

每行都包含归一化的数值(范围为0-1).它实际上是一个归一化的伪变量,其中某些行仅具有3-4个很少的功能(即,如果没有值,则分配0).大多数行具有大约10-20个功能.

Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features.

我使用KMeans对数据进行聚类,总是导致具有大量成员的聚类.经过分析,我注意到具有少于4个特征的行趋于聚集在一起,这不是我想要的.

I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want.

总有集群平衡吗?

推荐答案

产生平衡簇的k均值目标并非一部分.实际上,具有均衡群集的解决方案可能会很糟糕(只需考虑具有重复项的数据集). K-means使平方和最小,并将这些对象放在一个群集中似乎是有益的.

It is not part of the k-means objective to produce balanced clusters. In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial.

您看到的是在稀疏,非连续数据上使用k均值的典型效果.编码的分类变量,二进制变量和稀疏数据仅不太适合 means 的k均值使用.此外,您可能还需要仔细权重变量.

What you see is the typical effect of using k-means on sparse, non-continuous data. Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means. Furthermore, you'd probably need to carefully weight variables, too.

现在,一个可能会改善您的结果的修补程序(至少是感知到的质量,因为我认为它不会使它们在统计上获得更好的表现),是对每个向量进行归一化到单位长度(欧几里得范数1).这将强调那些具有很少非零条目的行.您可能会更喜欢结果,但更难解释.

Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). This will emphasize the ones of rows with few nonzero entries. You'll probably like the results more, but they are even much harder to interpret.

这篇关于KMeans聚类不平衡数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆