为什么向量归一化可以提高聚类和分类的准确性? [英] Why vector normalization can improve the accuracy of clustering and classification?

查看:260
本文介绍了为什么向量归一化可以提高聚类和分类的准确性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在《 Mahout在行动》中描述,归一化可以稍微提高准确性. 谁能解释原因,谢谢!

It is described in Mahout in Action that normalization can slightly improve the accuracy. Can anyone explain the reason, thanks!

推荐答案

标准化并非总是必需的,但很少有伤害.

Normalization is not always required, but it rarely hurts.

一些例子:

K均值:

K-均值聚类在空间的各个方向都是各向同性的", 因此倾向于产生或多或少的圆形(而不是拉长) 集群.在这种情况下,方差不相等等于 赋予方差较小的变量更多的权重.

K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance.

Matlab中的示例

Example in Matlab:

X = [randn(100,2)+ones(100,2);...
     randn(100,2)-ones(100,2)];

% Introduce denormalization
% X(:, 2) = X(:, 2) * 1000 + 500;

opts = statset('Display','final');

[idx,ctrs] = kmeans(X,2,...
                    'Distance','city',...
                    'Replicates',5,...
                    'Options',opts);

plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(ctrs(:,1),ctrs(:,2),'kx',...
     'MarkerSize',12,'LineWidth',2)
plot(ctrs(:,1),ctrs(:,2),'ko',...
     'MarkerSize',12,'LineWidth',2)
legend('Cluster 1','Cluster 2','Centroids',...
       'Location','NW')
title('K-means with normalization')

(仅供参考: 分布式集群:

对比分析表明,分布式聚类结果 取决于规范化程序的类型.

The comparative analysis shows that the distributed clustering results depend on the type of normalization procedure.

人工神经网络(输入):

如果输入变量是线性组合的(如在MLP中),则为 很少严格有必要标准化输入,至少在 理论.原因是输入向量的任何重新缩放都可以 通过更改相应的权重和偏差有效撤消, 使您获得与以前完全相同的输出.然而, 标准化输入有多种实际原因 可以使训练更快,并减少陷入困境的机会 局部最优.同样,可以进行权重衰减和贝叶斯估计 标准化的输入使操作更加方便.

If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

人工神经网络(输入/输出)

您应该对数据执行任何这些操作吗?答案是 取决于.

Should you do any of these things to your data? The answer is, it depends.

标准化输入变量或目标变量倾向于使训练 通过改善数值条件,可以更好地表现过程(请参见 ftp://ftp.sas.com/pub/neural/illcond/illcond.html ) 问题并确保涉及的各种默认值 初始化和终止是适当的.标准化目标 也会影响目标函数.

Standardizing either input or target variables tends to make the training process better behaved by improving the numerical condition (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html) of the optimization problem and ensuring that various default values involved in initialization and termination are appropriate. Standardizing targets can also affect the objective function.

案例标准化应谨慎对待,因为它会 丢弃信息.如果该信息无关紧要,则 标准化案例可能会很有帮助.如果该信息是 重要的话,那么规范案件就可能是灾难性的.

Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous.


有趣的是,更改度量单位甚至可能导致人们看到非常不同的群集结构:

在某些应用中,更改测量单位甚至可能导致 看到非常不同的集群结构.例如,年龄(以 年)和四个假想人的身高(厘米) 在表3中绘制并在图3中绘制.看来{A,B)和{C, 0)是两个分隔良好的群集.另一方面,当高度为 用英尺表示的一个获得表4和图4,其中明显 群集现在为{A,C}和{B,D}.这个分区是完全 与第一个不同,因为每个主题都收到了另一个 伴侣. (如果年龄大了,图4将会变得更加平坦. (以天为单位).

In some applications, changing the measurement units may even lead one to see a very different clustering structure. For example, the age (in years) and height (in centimeters) of four imaginary people are given in Table 3 and plotted in Figure 3. It appears that {A, B ) and { C, 0) are two well-separated clusters. On the other hand, when height is expressed in feet one obtains Table 4 and Figure 4, where the obvious clusters are now {A, C} and { B, D}. This partition is completely different from the first because each subject has received another companion. (Figure 4 would have been flattened even more if age had been measured in days.)

为避免这种对度量单位选择的依赖, 标准化数据的选项.这将转换原始 测量无单位变量.

To avoid this dependence on the choice of measurement units, one has the option of standardizing the data. This converts the original measurements to unitless variables.

Kaufman等人.继续讲一些有趣的内容注意事项(第11页):

Kaufman et al. continues with some interesting considerations (page 11):

从哲学的角度来看,标准化并不是真正意义上的 解决这个问题.确实,测量单位的选择引起了人们的注意 变量的相对权重.用较小的值表示变量 单位将导致该变量的更大范围,然后 对所得的结构有很大的影响.另一方面,通过 标准化一次尝试赋予所有变量相等的权重, 希望达到客观性.因此,它可以由 没有先验知识的从业者.但是,这很可能是 在一个变量中,某些变量在本质上比其他变量更重要 特定的应用,然后权重的分配应为 基于主题知识(例如参见Abrahamowicz,1985).在 另一方面,已经尝试设计聚类 与变量规模无关的技术 (弗里德曼和鲁宾,1967年). Hardy和Rasson(1982)的建议是 搜索分区以最大程度地减少 群集的凸包.原则上,这种方法是不变的 关于数据的线性变换,但不幸的是 不存在用于实现的算法(除了 仅限于二维的近似值).因此, 标准化的困境目前看来是不可避免的, 本书介绍的程序由用户自行选择.

From a philosophical point of view, standardization does not really solve the problem. Indeed, the choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985). On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user.

这篇关于为什么向量归一化可以提高聚类和分类的准确性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆