特征缩放对精度的影响 [英] Effect of feature scaling on accuracy

查看:108
本文介绍了特征缩放对精度的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用高斯混合模型进行图像分类.我有大约34,000个要素,属于三个类别,所有要素都位于23维空间中.我使用不同的方法对训练数据和测试数据进行了特征缩放,并且观察到精度实际上在执行缩放后会降低.我执行功能缩放是因为许多功能之间的顺序不同.我很想知道为什么会这样,我认为要素缩放会提高准确性,特别是考虑到要素之间的巨大差异.

解决方案

我认为特征缩放会提高准确性,特别是考虑到特征之间的巨大差异.

欢迎来到真实世界的伙伴.

通常,您确实希望要素具有相同的比例",这样您就不会使某些要素支配"其他要素.如果您的机器学习算法本质上本质上是几何的",则尤其如此. 几何"是指将样本视为空间中的,并在进行预测时依赖于点之间的距离(通常是欧氏/L2),即空间关系点很重要. GMM和SVM是具有这种性质的算法.

但是,特征缩放可能会搞砸,特别是如果某些特征本质上是分类/常规的,并且在将它们附加到其余特征时,您没有正确地对其进行预处理.此外,根据您的特征缩放方法,特定特征离群值的存在也可能会破坏该特征的特征缩放.例如,最小/最大"或单位方差"缩放将对异常值敏感(例如,如果您的功能之一编码了年收入或现金余额,并且数据集中有几个mi/billionaire)./p>

此外,当您遇到诸如此类的问题时,原因可能并不明显.这并不意味着您执行功能缩放,结果变差,然后功能缩放有问题.可能是您的方法一开始就搞砸了,而功能缩放后的结果恰好被搞砸了.

那么您的问题还有哪些其他原因?

  1. 我对最可能原因的猜测是您拥有高维数据,而没有 足够的训练样本.这是因为您的GMM将要估算协方差 矩阵使用维度为34000的数据.除非你有一个 很多数据,机会是您的一个或多个协方差矩阵 (每个高斯一个)将接近奇异或奇异. 这意味着从您的GMM开始的预测是毫无意义的,因为您的高斯炸毁"了,和/或EM算法在预定义的迭代次数之后就放弃了.
  2. 不良的测试方法.您没有将数据划分为适当的数据 培训/验证/测试集,而您没有正确执行测试.您拥有什么样的良好"表现 起初并不可信.这实际上很常见,因为自然的趋势是使用模型拟合的训练数据进行测试,而不是使用验证或测试集.

那你该怎么办?

  1. 请勿使用GMM进行图像分类.使用适当的 监督学习算法,尤其是在您了解图像的情况下 类别作为标签.特别是要避免特征缩放 完全使用随机森林或其变体(例如, 随机树).
  2. 获取更多培训数据.除非您将简单"分类(即, 玩具"/合成图像),或者您将其分类为几个图像 类(例如< = 5).请注意,这只是我从空中抽出的一个随机小数字.) 班级.一个好的出发点是至少获得几百个 每个类,或使用更复杂的算法来利用 数据中的结构以获得更好的性能.

基本上,我的意思是不要(只是)将机器学习领域/算法当作黑匣子和一堆要记住并随机尝试的技巧.尝试了解引擎盖下的算法/运算法则.这样,您将能够更好地诊断遇到的问题.


编辑(响应@Zee的澄清要求)

对于论文,我唯一能想到的就是支持向量分类的实用指南,由LibSVM的作者撰写.其中的示例显示了在各种数据集上进行SVM特征缩放的重要性.例如,考虑RBF/高斯内核.该内核使用平方L2范数.如果您的要素比例不同,这将影响价值.

此外,您表示功能的方式也很重要.例如,将代表高度的变量从米更改为cm或cm会影响诸如PCA之类的算法(因为该功能沿方向的变化已经改变.)请注意,这与典型"缩放不同(例如,最小/最大,Z -score等),因为这是 representation 的问题.无论单位如何,人的身高都相同.而典型的缩放功能会转换"数据,从而改变人员的身高". David Mackay教授,在其书的亚马逊页面上,机器学习的贝叶斯推理统计学习的要素.他们提到了将它们编码为特征的方法,例如,用3个二进制变量替换可以表示3个类别的变量,其中一个设置为"1"以指示样本具有该类别.这对于诸如线性回归(或线性分类器)之类的方法很重要.注意,这是关于编码分类变量/特征,本身不是不是缩放,但是它们是特征预处理设置的一部分,因此很有用.可以在下面的Hal Duame III的书中找到更多内容.

Hal Duame III撰写的《 机器学习课程.搜索缩放".本书最早的例子之一是它如何影响KNN(如果使用RBF/高斯内核,它将仅使用L2距离,而GMM,SVM等将使用L2距离).第4章实践中的机器学习"中提供了更多详细信息.不幸的是,图像/图形未显示在PDF中.本书提供了关于特征编码和缩放的最佳方法之一,特别是如果您使用自然语言处理(NLP).例如,请参阅他对特征应用对数的说明(即对数变换).这样,对数的总和就成为要素乘积的对数,并且这些要素的效果"/贡献"由对数逐渐变小.

请注意,上述所有教科书均可从上述链接免费下载.

I am working on image classification using Gaussian Mixture Models. I have around 34,000 features, belonging to three classes, all lying in a 23 dimensional space. I performed feature scaling on both the training and testing data using different methods, and I observed that accuracy actually reduces after performing scaling. I performed feature scaling because there was a difference of many orders between many features. I am curious to know why this is happening, I thought that feature scaling would increase the accuracy, especially given the large differences in features.

解决方案

I thought that feature scaling would increase the accuracy, especially given the large differences in features.

Welcome to the real world buddy.

In general, it is quite true that you want features to be in the same "scale" so that you don't have some features "dominating" other features. This is especially so if your machine learning algorithm is inherently "geometrical" in nature. By "geometrical", I mean it treats the samples are points in a space, and relies on distances (usually Euclidean/L2 as is your case) between points in making its predictions, i.e., the spatial relationships of the points matter. GMM and SVM are algorithms of this nature.

However, feature scaling can screw things up, especially if some features are categorical/ordinal in nature, and you didn't properly preprocess them when you appended them to the rest of your features. Furthermore, depending on your feature scaling method, presence of outliers for a particular feature can also screw up the feature scaling for that feature. For e.g., a "min/max" or "unit variance" scaling is going to be sensitive to outliers (e.g., if one of your feature encodes yearly income or cash balance and there are a few mi/billionaires in your dataset).

Also, when you experience a problem such as this, the cause may not be obvious. It does not mean you perform feature scaling, result goes bad, then feature scaling is at fault. It could be that your method was screwed up to begin with, and the result after feature scaling just happens to be more screwed up.

So what could be other cause(s) of your problem?

  1. My guess for the most likely cause is that you have high-dimensional data and not enough training samples. This is because your GMM is going to estimating covariance matrices using data that is 34000 in dimension. Unless you have a lot of data, chances are one or more of your covariance matrices (one for each gaussian) are going to be near singular or singular. This means the predictions from your GMM are nonsense to begin with because your gaussians "blew" up, and/or the EM algorithm just gave up after a predefined number of iterations.
  2. Poor testing methodology. You did not have data divided into proper training/validation/test sets, and you did not perform the testing properly. What "good" performance you have in the beginning was not credible. This is actually very common, as the natural tendency is to test using the training data the model was fitted on and not on a validation or test set.

So what can you do?

  1. Don't use a GMM for image categorization. Use a proper supervised learning algorithm, especially if you have known image categories as labels. In particular, to avoid the feature scaling altogether, use random forest or its variants (e.g., extremely randomized trees).
  2. Get more training data. Unless you are classifying "simple" (i.e., "toy"/synthetic images) or you are classifying them into a few image classes (e.g., <= 5. Note this is just a random small number I pulled out of the air.), you really to have a good deal of images per class. A good starting point is to get at least a couple of hundreds per class, or use a more sophisticated algorithm to exploit the structure within your data to arrive at better performance.

Basically, my point is don't (just) treat machine learning field/algorithms as black boxes and a bunch of tricks which you memorize and try at random. Try to understanding the algorithm/math under the hood. That way, you'll be better able to diagnose the problem(s) you encounter.


EDIT (in response to request for clarification by @Zee)

For papers, the only one I can recall off the top of my head is A Practical Guide to Support Vector Classification by the authors of LibSVM. Examples therein show the importance of feature scaling for SVM on various datasets. E.g., consider the RBF/Gaussian kernel. This kernel uses the square L2 norm. If your features are of different scale, this will affect the value.

Also, how you represent your features matter. E.g., changing a variable that represents height from meters to cm or inches will affect algorithms such as PCA (because variance along direction for that feature has changed.) Note this is different from the "typical" scaling (e.g., min/max, Z-score etc.) in that this is a matter of representation. The person is still the same height regardless of the unit. Whereas typical feature scaling "transform" the data, which changes the "height" of the person. Prof. David Mackay, on the Amazon page of his book, Information Theory for Machine Learning, has a comment in this vein when asked why he did not include PCA in his book.

For ordinal and categorical variables, they are mentioned briefly in Bayesian Reasoning for Machine Learning, The Elements of Statistical Learning. They mention ways to encode them as features, for e.g., replacing a variable that can represent 3 categories with 3 binary variables, with one set to "1" to indicate the sample has that category. This is important for methods such as Linear Regression (or Linear Classifiers). Note this is about encoding categorical variables/features, not scaling per se, but they are part of the feature preprocessing set up, and hence useful to know. More can be found in Hal Duame III's book below.

The book A Course in Machine Learning by Hal Duame III. Search for "scaling". One of the earliest example in the book is how it affects KNN (which just uses L2 distance, which GMM, SVM etc. uses if you use the RBF/gaussian kernel). More details are given in the chapter 4, "Machine Learning in Practice". Unfortunately the images/plots are not shown in the PDF. This book has one of the nicest treatments on feature encoding and scaling, especially if you work on Natural Language Processing (NLP). E.g., see his explanation of applying the logarithm to features (i.e., log transform). That way, sums of logs become log of product of features, and "effects"/"contributions" of these features are tapered by the logarithm.

Note that all the aforementioned textbooks are freely downloadable from the above links.

这篇关于特征缩放对精度的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆