如何处理高维输入空间的机器学习问题? [英] How to approach machine learning problems with high dimensional input space?

查看:36
本文介绍了如何处理高维输入空间的机器学习问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将一些 ML 算法(分类,更具体地说,特别是 SVM)应用于一些高维输入时,我应该如何处理这种情况,而我得到的结果并不十分令人满意?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?

1、2 或 3 维数据可以与算法的结果一起可视化,因此您可以了解正在发生的事情,并了解如何解决问题.一旦数据超过 3 维,除了直观地玩弄参数之外,我真的不知道如何攻击它?

1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?

推荐答案

你对数据做了什么?我的回答是:没有.SVM 被设计用于处理高维数据.我现在正在研究一个涉及使用 SVM 进行监督分类的研究问题.除了在 Internet 上查找资源外,我还对分类前降维的影响进行了自己的实验.使用 PCA/LDA 预处理特征并没有显着提高 SVM 的分类精度.

What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.

对我来说,这完全符合 SVM 的工作方式.令 x 为 m 维特征向量.设 y = Ax,其中 y 在 R^n 中,x 在 R^m 中,n

To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.

这里有一个关于在 SVM 之前使用 PCA 的讨论:链接

Here is one discussion that debates the use of PCA before SVM: link

可以做的是更改您的 SVM 参数.例如,对于 libsvm link,参数 C 和 gamma 至关重要到分类成功.libsvm 常见问题解答,尤其是此条目 link 包含更多有用的提示.其中:

What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:

  1. 在分类之前缩放特征.
  2. 尝试获得平衡的课程.如果不可能,那么惩罚一个班级比另一个班级多.查看更多关于 SVM 不平衡的参考资料.
  3. 检查 SVM 参数.尝试多种组合以获得最佳组合.
  4. 首先使用 RBF 内核.它几乎总是效果最好(从计算上来说).
  5. 差点忘了……在测试之前,交叉验证

让我添加这个数据点".我最近在四个专有数据集上使用带有 PCA 预处理的 SVM 进行了另一项大规模实验.PCA 没有改善任何降维选择的分类结果.具有简单对角线缩放的原始数据(对于每个特征,减去均值并除以标准差)表现更好.我没有做出任何广泛的结论——只是分享这个实验.也许在不同的数据上,PCA 可以提供帮助.

Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.

这篇关于如何处理高维输入空间的机器学习问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆