MiniBatchKMeans在后续迭代后给出不同的质心 [英] MiniBatchKMeans gives different centroids after subsequent iterations

查看:82
本文介绍了MiniBatchKMeans在后续迭代后给出不同的质心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用anaconda中sklearn.cluster模块中的MiniBatchKMeans模型.我正在对一个包含大约75,000点的数据集进行聚类.看起来像这样:

I am using the MiniBatchKMeans model from the sklearn.cluster module in anaconda. I am clustering a data-set that contains approximately 75,000 points. It looks something like this:

data = np.array([8,3,1,17,5,21,1,7,1,26,323,16,2334,4,2,67,30,2936,2,16,12,28,1,4,190...])

我使用以下过程拟合数据.

I fit the data using the process below.

from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(batch_size=100) kmeans.fit(data.reshape(-1,1)

from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(batch_size=100) kmeans.fit(data.reshape(-1,1)

这一切都很好,我继续查找数据的质心:

This is all well and okay, and I proceed to find the centroids of the data:

centroids = kmeans.cluster_centers_ print centroids

centroids = kmeans.cluster_centers_ print centroids

哪个给了我以下输出:

array([[ 13.09716569], [ 2908.30379747], [ 46.05089228], [ 725.83453237], [ 95.39868475], [ 1508.38356164], [ 175.48099948], [ 350.76287263]])

array([[ 13.09716569], [ 2908.30379747], [ 46.05089228], [ 725.83453237], [ 95.39868475], [ 1508.38356164], [ 175.48099948], [ 350.76287263]])

但是,当我再次使用相同的数据运行该过程时,会得到不同的质心值,例如:

But, when I run the process again, using the same data, I get different values for the centroids, such as this:

array([[ 29.63143489], [ 1766.7244898 ], [ 171.04417206], [ 2873.70454545], [ 70.05295277], [ 1074.50387597], [ 501.36134454], [ 8.30600975]])

array([[ 29.63143489], [ 1766.7244898 ], [ 171.04417206], [ 2873.70454545], [ 70.05295277], [ 1074.50387597], [ 501.36134454], [ 8.30600975]])

任何人都可以解释这是为什么吗?

Can anyone explain why this is?

推荐答案

阅读什么是小批量k均值.

Read up on what mini-batch k-means is.

它甚至不会收敛.再进行一次迭代,结果将再次 更改.

It will never even converge. Do one more iteration, the result will change again.

它是为数据集设计的,因此您无法一次将它们加载到内存中.因此,您加载了一批,假装这是完整的数据集,请进行一次迭代.重复下一批.如果您的批次足够大且随机,则结果将足够接近"以可用.尽管itjs从来都不是最佳选择.

It is design for data sets so huge you cannot load them into memory at once. So you load a batch, pretend this were the full data set, do one iterarion. Repeat woth the next batch. If your batches are large enough and random, then the result will be "close enough" to be usable. While itjs never optimal.

因此:

    迷你批次的结果甚至比常规k均值结果更为随机.他们每次迭代都会更改.
  1. 如果您可以将数据加载到内存中,请不要使用minibatch.取而代之的是使用快速的k-means实现. (大多数情况出奇地慢).
  1. the minibatch results are even more random than regular k-means results. They change every iteration.
  2. if you can load your data into memory, don't use minibatch. Instead use a fast k-means implementation. (most are surprisingly slow).

P.S.对一维数据进行 sort 数据集,然后使用从排序中受益的算法,而不是k-means.

P.S. on one-dimensional data, sort your data set and then use an algorithm that benefits from the sorting instead of k-means.

这篇关于MiniBatchKMeans在后续迭代后给出不同的质心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆