K表示针对多维数据的聚类 [英] K means clustering for multidimensional data

查看:494
本文介绍了K表示针对多维数据的聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果数据集具有440个对象和8个属性(数据集取自UCI机器学习存储库).然后,我们如何计算此类数据集的质心. (批发客户数据) https://archive.ics.uci.edu/ml/datasets/Wholesale+客户

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

如果我计算每行值的平均值,那将是质心吗? 以及如何在matlab中绘制结果簇.

if i calculate the mean of values of each row, will that be the centroid? and how do I plot resulting clusters in matlab.

推荐答案

好,首先,在数据集中,有1行对应于数据中的单个示例,您有440行,这意味着数据集包含440个示例.每列包含该特定功能(或称为属性)的值,例如数据集中的第1列包含要素Channel的值,第2列包含要素Region的值,依此类推.

OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.

K均值

K-Means

现在,对于K-Means聚类,您需要指定集群数(K-Means中的 K ).假设您要K = 3个簇,那么初始化K-Means的最简单方法是从数据集中随机选择3个示例(即3行,从440行中随机抽取)作为质心. 现在,这三个示例就是您的质心.

Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.

您可以将质心视为3个bin,并且想要将数据集中的每个示例放入最近的(通常由欧几里得距离测量;请检查函数norm Matlab)bin.

You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.

在将所有示例放入最接近的bin中的第一轮之后,您可以通过计算所有示例在其各自bin中的mean来重新计算质心.重复将所有示例放入最接近的bin的过程,直到数据集中的所有示例不会移动到另一个bin.

After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.

一些 Matlab 起点

Some Matlab starting points

您通过X = load('path/to/the/dataset', '-ascii');

在您的情况下,X将是一个440x8矩阵.

In your case X will be a 440x8 matrix.

您可以通过以下公式计算从示例到质心的欧几里得距离 distance = norm(example - centroid1);, 其中examplecentroid1都具有维度1x8.

You can calculate the Euclidean distance from an example to a centroid by distance = norm(example - centroid1);, where both, example and centroid1 have dimensionality 1x8.

重新计算质心的工作原理如下,假设您已经进行了1次K-Means迭代,并将所有示例放入了它们各自最接近的bin中.说Bin1现在包含最接近centroid1的所有示例,因此Bin1具有维度127x8,这意味着440中的127个示例在此bin中.要计算下一次迭代的质心位置,您可以执行centroid1 = mean(Bin1);.您将执行与其他垃圾箱类似的操作.

Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.

关于绘图,您必须注意,数据集包含8个要素,这意味着8个维度,并且这些元素是不可见的.我建议您创建或寻找一个仅包含2个特征的(虚拟)数据集,因此可以通过使用Matlab的plot()函数将其可视化.

As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.

这篇关于K表示针对多维数据的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆