如何解决高维输入空间的机器学习问题? [英] How to approach machine learning problems with high dimensional input space?

查看:188
本文介绍了如何解决高维输入空间的机器学习问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试在某些高维输入上应用某种ML算法(分类,更具体地讲,特别是SVM)时,应该如何处理?我得到的结果并不令人满意?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?

1、2或3维数据以及该算法的结果都可以可视化,因此您可以掌握正在发生的事情,并了解如何解决该问题.一旦数据超过3维,除了直观地使用参数之外,我还不确定如何进行攻击?

1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?

推荐答案

您如何处理数据?我的回答:一无所有. SVM是设计的,用于处理高维数据.我现在正在研究一个涉及使用SVM进行监督分类的研究问题.除了在Internet上查找来源外,我还对分类之前降维的影响进行了自己的实验.使用PCA/LDA预处理功能不会显着提高SVM的分类准确性.

What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.

对我来说,从SVM的工作方式来看这完全有意义.令x为m维特征向量.令y = Ax,其中对于n <n,y在R ^ n中,x在R ^ m中. m,即y被x投影到较低维度的空间上.如果类Y1和Y2在R ^ n中是线性可分离的,则相应的类X1和X2在R ^ m中是线性可分离的.因此,从理论上讲,原始子空间应至少"与其在较低维度上的投影一样可分离,即PCA不应有帮助.

To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.

以下是讨论在支持SVM之前使用PCA的一种讨论:

Here is one discussion that debates the use of PCA before SVM: link

您可以 做的就是更改SVM参数.例如,对于libsvm 链接,参数C和gamma至关重要才能成功分类. libsvm常见问题解答,尤其是此条目链接包含更有用的提示.其中:

What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:

  1. 在分类之前先缩放特征.
  2. 尝试获得平衡的课程.如果不可能的话,那么对一类人要比对另一类人更重.查看有关SVM不平衡的更多参考.
  3. 检查SVM参数.尝试多种组合以获得最佳组合.
  4. 首先使用RBF内核.从计算上来说,它几乎总是最有效的.
  5. 几乎忘记了...测试之前,交叉验证
  1. Scale your features before classification.
  2. Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
  3. Check the SVM parameters. Try many combinations to arrive at the best one.
  4. Use the RBF kernel first. It almost always works best (computationally speaking).
  5. Almost forgot... before testing, cross validate!

让我添加这个数据点".最近,我使用SVM和PCA预处理对四个专用数据集进行了另一个大规模实验.对于任何降低的尺寸选择,PCA均未改善分类结果.具有简单对角线缩放比例的原始数据(对于每个特征,均值减去平均值并除以标准偏差)效果更好.我没有得出任何广泛的结论,只是分享这个实验.也许在不同的数据上,PCA可以提供帮助.

Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.

这篇关于如何解决高维输入空间的机器学习问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆