为什么PCA降低了Logistic回归的性能? [英] Why did PCA reduced the performance of Logistic Regression?

查看:254
本文介绍了为什么PCA降低了Logistic回归的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对具有50000 X 370维数据的二元分类问题进行了Logistic回归,我获得了大约90%的准确性,但是当我对数据进行PCA + Logistic时,我的准确性降低到10%,我对此感到非常震惊看到这个结果.谁能解释可能出了什么问题?

解决方案

不能保证PCA会有所帮助,也不会损害学习过程.特别是-如果您使用PCA来减少尺寸量-您正在从数据中删除信息,那么一切都会发生-如果删除的数据是多余的,则可能会获得更高的分数问题的重要组成部分-您会变得更糟.即使不降低尺寸,但仅通过PCA来旋转"输入空间既可以有益又可以损害过程-必须记住,在监督学习方面, PCA只是一种启发式.强>. PCA的唯一保证是,每个结果维将解释越来越少的方差,并且就解释前K个维中的方差而言,这是最佳的仿射变换.就这样.这可能与实际问题完全无关,因为PCA根本不考虑标签.给定任何数据集,PCA都将以仅依赖于点的位置的方式对其进行转换-因此,对于某些标记(与数据的一般形状一致)-可能会有所帮助,但对于其他一些标记(更复杂的标记模式)-它可能会有所帮助将破坏以前可检测的关系.此外,由于PCA会导致某些比例的变化,因此您可能需要分类器的不同超参数-例如LR的正则化强度.

现在回到您的问题上-在您的情况下,问题是……代码中的错误.您不能将准确度大大降低到50%以下. 10%的准确性意味着,使用与分类器相反的语言可以得出90%的准确性(当它说"true"时回答"false",反之亦然).因此,即使PCA可能无济于事(或者甚至可能会造成损害,如上所述),但在您的情况下,您的代码肯定是错误的.

I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong?

解决方案

There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything can happen - if the removed data was redundant, you will probably get better scores, if it was an important part of the problem - you will get worse. Even without dropping dimensions, but just "rotating" input space through PCA can both beneift and harm the process - one must remember that PCA is just a heuristic, when it comes to supervised learning. The only guarantee of PCA is that each consequtive dimension will explain less and less variance, and that it is the best affine transformation in terms of explaining variance in the first K dimensions. That's all. This can be completely unrelated to actual problem, as PCA does not consider labels at all. Given any dataset PCA will transform it in a way which depends only on the positions of points - so for some labelings (consistent with general shape of the data) - it might help, but for many others (more complex patterns of labels) - it will destroy the previously detectable relations. Futhermore, as PCA leads to change of some scalings, you might need different hyperparameters of your classifier - such as regularization strength for LR.

Now getting back to your problem - I would say that in your case the problem is ... a bug in your code. you cannot drop in accuracy significantly below 50%. 10% of accuracy means, that using the opposite of your classifier would give 90% (just answering "false" when it says "true" and the other way around). So even though PCA might not help (or might even harm, as described) - in your case it is an error in your code for sure.

这篇关于为什么PCA降低了Logistic回归的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆