在SOM中对数据和/或权重向量进行归一化是否正确? [英] Is it right to normalize data and/or weight vectors in a SOM?

查看:496
本文介绍了在SOM中对数据和/或权重向量进行归一化是否正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我被(应该)很简单的东西所困扰:

So I am being stumped by something that (should) be simple:

我已经为简单的播放"二维数据集编写了SOM.数据如下:

I have written a SOM for a simple 'play' two-dimensional data set. Here is the data:

您可以自己找出3个群集.

You can make out 3 clusters by yourself.

现在,有两件事使我感到困惑.首先是我拥有的教程在SOM对其进行处理之前将数据标准化.这意味着,它将每个数据向量标准化为长度为1.(欧几里得范数).如果我这样做,那么数据将如下所示:

Now, there are two things that confuse me. The first is that the tutorial that I have, normalizes the data before the SOM gets to work on it. This means, it normalizes each data vector to have length 1. (Euclidean norm). If I do that, then the data looks like this:

(这是因为所有数据都已投影到单位圆上).

(This is because all the data has been projected onto the unit circle).

所以,我的问题如下:

1)这是正确的吗?将数据投影到单位圆上似乎是不好的,因为您无法再划分出3个群集...这对于SOM来说是不争的事实吗? (即,它们仅在单位圆上起作用).

1) Is this correct? Projecting the data down onto the unit circle seems to be bad, because you can no longer make out 3 clusters... Is this a fact of life for SOMs? (ie, that they only work on the unit circle).

2)第二个相关的问题是,不仅数据被标准化为长度为1,而且每次迭代后每个输出单元的权重向量也是如此.我知道他们这样做是为了使权重向量不会爆炸",但对我来说似乎是错误的,因为权重向量的重点是保留距离信息.如果将它们标准化,则会失去正确聚类"的能力.例如,由于SOM以相同的方式向下投射到单位圆,SOM如何区分左下方的群集和右上方的群集?

2) The second related question is that not only are the data normalized to have length 1, but so are the weight vectors of each output unit after every iteration. I understand that they do this so that the weight vectors dont 'blow up', but it seems wrong to me, since the whole point of the weight vectors is to retain distance information. If you normalize them, you lose the ability to 'cluster' properly. For example, how can the SOM possibly distinguish between the cluster on the lower left, from the cluster on the upper right, since they project down to the unit circle the same way?

对此我感到非常困惑.是否应将数据标准化为SOM中的单位长度?权重向量也应该归一化吗?

I am very confused by this. Should data be normalized to unit length in SOMs? Should the weight vectors be normalized as well?

谢谢!

编辑

此处的数据,另存为MATLAB的.mat文件.它是一个简单的二维数据集.

Here is the data, saved as a .mat file for MATLAB. It is a simple 2 dimensional data set.

推荐答案

要确定是否要对输入数据进行规范化,这取决于这些数据代表什么.假设您在二维(或三维)输入数据上进行聚类,其中每个数据向量代表一个空间点.第一维是x坐标,第二维是y坐标.在这种情况下,您不对输入数据进行归一化,因为输入要素(每个维度)之间是可比较的.

To decide if you are going to normalize input data or not, it depends on what these data represent. Lets say that you doing clustering on two dimensional (or three dimensional) input data in which each data vector represents a spatial point. First dimension is x coordinate and second is y coordinate. In this case you don't normalize the input data because the input features (each dimension) are comparable between each other.

如果您再次在二维空间上进行聚类,但每个输入向量代表一个人的年龄和年收入,则第一个特征(维度)为年龄,第二个特征为年收入,则必须对输入要素,因为它们代表的是不同的东西(不同的度量单位),并且比例完全不同.让我们检查这些输入向量:D1(25,30000),D2(50,30000)和D3(25,60000).与D1相比,D2和D3都将功能之一加倍.请记住,SOM使用欧几里得距离度量.距离(D1,D2)= 25,距离(D1,D3)= 30000. 对于第一个输入要素(年龄)来说,这有点不公平",因为尽管将其加倍,但得到的距离却比第二个示例(D1,D3)小得多.

If you are doing clustering again on two dimension space but each input vector represents the age and the annual income of a person, the first feature (dimension) is the age and the second is the annual income, then you must normalize the input features because they represent something different (different measurement unit) and in a completely different scale. Lets examine these input vectors: D1(25, 30000), D2(50, 30000) and D3(25, 60000). Both D2 and D3 are doubling one of the features compared to D1. Keep in mind that SOM uses Euclidian distance measures. Distance(D1, D2) = 25 and Distance(D1, D3) = 30000. It's kind of "unfair" for the first input feature (age) because although you doubling it you get a much smaller distance as opposed to the second example (D1,D3).

进行检查,它也有类似的示例

如果要标准化输入数据,请对每个要素/尺寸(输入数据表上的每一列)进行标准化.引用 som_normalize手册:

If you are going to normalize your input data, you normalize on each feature/dimension (each column on you input data table). Quoting from som_normalize manual:

规范化总是一元运算"

"Normalizations are always one-variable operations"

也请检查此内容,以获取有关归一化的简要说明以及是否要阅读更多尝试(第7章就是您想要的)

Check also this for a brief explanation on normalization and if you want to read more try this (chapter 7 is what you want)

最常见的归一化方法是将每个尺寸数据缩放为[0,1]或将其转换为均值为零和标准偏差为1的方法.第一种方法是从每个输入中减去其尺寸的最小值(列)并用最大值minun除以最小值(其维数).

The most common normalization methods are scaling each dimension data to [0,1] or transforming them to have a zero mean and standard deviation 1. The first is done by substracting from each input the min value of its dimension (column) and the dividing with the the max value minun the min value (of its dimension).

Xi,norm =(Xi-Xmin)/(Xmax-Xmin)

Xi,norm = (Xi - Xmin)/(Xmax-Xmin)

Yi,norm =(Yi-Ymin)/(Ymax-Ymin)

Yi,norm = (Yi - Ymin)/(Ymax-Ymin)

第二种方法是减去每个维度的平均值,然后除以标准偏差.

In the second method you substract the mean value of each dimension and then divide with standard deviation.

Xi,norm =(Xi-Xmean)/(Xsd)

Xi,norm = (Xi - Xmean)/(Xsd)

每种方法各有利弊.例如,第一种方法对数据中的异常值非常敏感.在检查了数据集的统计特征之后,您应该选择.

Each method has pros/cons. For example the first method is very sensitive to outliers in data. You should choose after you have examined the statistical characteristics of your dataset.

在单位圆中投影实际上不是一种归一化方法,而实际上是一种降低维数的方法,因为在投影之后,您可以用单个数字(例如,其角度)替换每个数据点.您不必这样做.

Projecting in the unit circle is not actually a normalization method but more of a dimensionallity reduction method, since after the projection you could replace each data point with a single number (eg. its angle). You don't have to do this.

这篇关于在SOM中对数据和/或权重向量进行归一化是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆