KMeans的不平衡因子? [英] Unbalanced factor of KMeans?

查看:86
本文介绍了KMeans的不平衡因子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark中的总和变坏了 a>

计算Kmeans成本中,我们看到了如何计算Kmeans的成本模型.我想知道我们是否能够计算不平衡因子?

In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?

如果Spark没有提供此类功能,是否有任何简单的方法来实现呢?

If there is no such functionality provide by Spark, is there any easy way to implement this?

我无法找到不平衡因素的参考,但应该类似于Yael的unbalanced_factor(我的评论):

I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):

// @hist: the number of points assigned to a cluster
// @n:    the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
  int vw;
  double tot = 0, uf = 0;

  for (vw = 0 ; vw < n ; vw++) {
    tot += hist[vw];
    uf += hist[vw] * (double) hist[vw];
  }

  uf = uf * n / (tot * tot);

  return uf;

}

我在此处找到了.

因此,想法是tot(总计)将等于分配给聚类的点数(即等于我们的数据集的大小),而uf(对于不平衡因数)则保持分配给群集的点数.

So the idea is that tot (for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf (for unbalanced factor) holds the square of the number of points assigned to a cluster.

最后他使用uf = uf * n / (tot * tot);进行计算.

推荐答案

python中,可能类似于:

# I suppose you are passing an RDD of tuples, where the key is the cluster and the value is a vector with the features.
def unbalancedFactor(rdd):
  pdd = rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b) # you can obtain the number of points per cluster
  n = pdd.count()
  total = pdd.map(lambda x: x[1]).sum() 
  uf = pdd.map(lambda x: x[1] * float(x[1])).sum()

  return uf * n / (total * total)

这篇关于KMeans的不平衡因子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆