KMeans 的不平衡因子? [英] Unbalanced factor of KMeans?

查看:36
本文介绍了KMeans 的不平衡因子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题的答案在:Spark 中的总和变坏

计算 Kmeans 的成本中,我们看到了如何计算他的 KMeans 的成本模型.我想知道我们是否能够计算出不平衡因子?

In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?

如果Spark没有提供这样的功能,有什么简单的方法可以实现吗?

If there is no such functionality provide by Spark, is there any easy way to implement this?

我找不到不平衡因子的参考,但它应该类似于 Yael 的 unbalanced_factor(我的评论):

I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):

// @hist: the number of points assigned to a cluster
// @n:    the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
  int vw;
  double tot = 0, uf = 0;

  for (vw = 0 ; vw < n ; vw++) {
    tot += hist[vw];
    uf += hist[vw] * (double) hist[vw];
  }

  uf = uf * n / (tot * tot);

  return uf;

}

我在这里找到了它.

所以这个想法是 tot(总计)将等于分配给集群的点数(即等于我们数据集的大小),而 uf(对于不平衡因子)保存分配给集群的点数的平方.

So the idea is that tot (for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf (for unbalanced factor) holds the square of the number of points assigned to a cluster.

最后他使用uf = uf * n/(tot * tot);来计算它.

推荐答案

python 中可能是这样的:

In python it could be something like:

# I suppose you are passing an RDD of tuples, where the key is the cluster and the value is a vector with the features.
def unbalancedFactor(rdd):
  pdd = rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b) # you can obtain the number of points per cluster
  n = pdd.count()
  total = pdd.map(lambda x: x[1]).sum() 
  uf = pdd.map(lambda x: x[1] * float(x[1])).sum()

  return uf * n / (total * total)

这篇关于KMeans 的不平衡因子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆