为什么 PCA 降低了 Logistic 回归的性能? [英] Why did PCA reduced the performance of Logistic Regression?

查看:38
本文介绍了为什么 PCA 降低了 Logistic 回归的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用50000 X 370维的数据对一个二元分类问题进行了Logistic回归.我得到了大约90%的准确率.但是当我对数据做PCA+logistic时,我的准确率下降到了10%,我很震惊看到这个结果.谁能解释一下可能出了什么问题?

I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong?

推荐答案

不能保证 PCA 会帮助或损害学习过程.特别是 - 如果您使用 PCA 来减少维度数量 - 您正在从数据中删除信息,因此一切都可能发生 - 如果删除的数据是冗余的,您可能会获得更好的分数,如果它是问题的一个重要部分——你会变得更糟.即使不降低维度,但只是通过 PCA 来旋转"输入空间既可能对这个过程有利也可能有害——必须记住,PCA 只是一种启发式,当涉及到监督学习强>.PCA 的唯一保证是每个后续维度将解释越来越少的方差,并且它是解释前 K 维方差的最佳仿射变换.就这样.这可能与实际问题完全无关,因为 PCA 根本不考虑标签.给定任何数据集 PCA 将以仅取决于点位置的方式对其进行转换 - 因此对于某些标签(与数据的一般形状一致) - 它可能会有所帮助,但对于其他许多(更复杂的标签模式) - 它将破坏先前可检测的关系.此外,由于 PCA 会导致某些缩放的变化,您可能需要分类器的不同超参数 - 例如 LR 的正则化强度.

There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything can happen - if the removed data was redundant, you will probably get better scores, if it was an important part of the problem - you will get worse. Even without dropping dimensions, but just "rotating" input space through PCA can both beneift and harm the process - one must remember that PCA is just a heuristic, when it comes to supervised learning. The only guarantee of PCA is that each consequtive dimension will explain less and less variance, and that it is the best affine transformation in terms of explaining variance in the first K dimensions. That's all. This can be completely unrelated to actual problem, as PCA does not consider labels at all. Given any dataset PCA will transform it in a way which depends only on the positions of points - so for some labelings (consistent with general shape of the data) - it might help, but for many others (more complex patterns of labels) - it will destroy the previously detectable relations. Futhermore, as PCA leads to change of some scalings, you might need different hyperparameters of your classifier - such as regularization strength for LR.

现在回到你的问题 - 我会说在你的情况下问题是......你的代码中的一个错误.您不能将准确度显着降低到 50% 以下.10% 的准确率意味着,使用与分类器相反的 将给出 90%(当它说真"时只回答假",反之亦然).因此,即使 PCA 可能无济于事(甚至可能有害,如所述) - 在您的情况下,这肯定是您的代码中的错误.

Now getting back to your problem - I would say that in your case the problem is ... a bug in your code. you cannot drop in accuracy significantly below 50%. 10% of accuracy means, that using the opposite of your classifier would give 90% (just answering "false" when it says "true" and the other way around). So even though PCA might not help (or might even harm, as described) - in your case it is an error in your code for sure.

这篇关于为什么 PCA 降低了 Logistic 回归的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆