K表示针对多维数据的聚类 [英] K means clustering for multidimensional data
问题描述
如果数据集具有440个对象和8个属性(数据集取自UCI机器学习存储库).然后,我们如何计算此类数据集的质心. (批发客户数据) https://archive.ics.uci.edu/ml/datasets/Wholesale+客户
if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
如果我计算每行值的平均值,那将是质心吗? 以及如何在matlab中绘制结果簇.
if i calculate the mean of values of each row, will that be the centroid? and how do I plot resulting clusters in matlab.
推荐答案
好,首先,在数据集中,有1行对应于数据中的单个示例,您有440行,这意味着数据集包含440个示例.每列包含该特定功能(或称为属性)的值,例如数据集中的第1列包含要素Channel
的值,第2列包含要素Region
的值,依此类推.
OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel
, column 2 the values for the feature Region
and so on.
K-Means
现在,对于K-Means聚类,您需要指定集群数(K-Means中的 K ).假设您要K = 3个簇,那么初始化K-Means的最简单方法是从数据集中随机选择3个示例(即3行,从440行中随机抽取)作为质心. 现在,这三个示例就是您的质心.
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
您可以将质心视为3个bin,并且想要将数据集中的每个示例放入最近的(通常由欧几里得距离测量;请检查函数norm
Matlab)bin.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm
in Matlab) bin.
在将所有示例放入最接近的bin中的第一轮之后,您可以通过计算所有示例在其各自bin中的mean
来重新计算质心.重复将所有示例放入最接近的bin的过程,直到数据集中的所有示例不会移动到另一个bin.
After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean
of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.
一些 Matlab 起点
Some Matlab starting points
您通过X = load('path/to/the/dataset', '-ascii');
在您的情况下,X
将是一个440x8
矩阵.
In your case X
will be a 440x8
matrix.
您可以通过以下公式计算从示例到质心的欧几里得距离
distance = norm(example - centroid1);
,
其中example
和centroid1
都具有维度1x8
.
You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);
,
where both, example
and centroid1
have dimensionality 1x8
.
重新计算质心的工作原理如下,假设您已经进行了1次K-Means迭代,并将所有示例放入了它们各自最接近的bin中.说Bin1
现在包含最接近centroid1
的所有示例,因此Bin1
具有维度127x8
,这意味着440中的127个示例在此bin中.要计算下一次迭代的质心位置,您可以执行centroid1 = mean(Bin1);
.您将执行与其他垃圾箱类似的操作.
Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1
now contains all examples that are closest to centroid1
and therefore Bin1
has dimensionality 127x8
, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);
. You would do similar things to your other bins.
关于绘图,您必须注意,数据集包含8个要素,这意味着8个维度,并且这些元素是不可见的.我建议您创建或寻找一个仅包含2个特征的(虚拟)数据集,因此可以通过使用Matlab的plot()
函数将其可视化.
As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot()
function.
这篇关于K表示针对多维数据的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!