相关功能和分类准确性 [英] Correlated features and classification accuracy

查看:73
本文介绍了相关功能和分类准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问每个人有关相关特征(变量)如何影响机器学习算法分类精度的问题。具有相关特征的意思是它们之间的关联,而不是与目标类别的关联(即几何图形的周长和面积或教育水平和平均收入)。我认为相关特征会对分类算法的准确性产生负面影响,我想说是因为相关性使其中之一变得无用。真的是这样吗?问题是否随分类算法类型而变化?任何关于论文和讲座的建议都非常欢迎!谢谢

I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the perimeter and the area of a geometric figure or the level of education and the average income). In my opinion correlated features negatively affect eh accuracy of a classification algorithm, I'd say because the correlation makes one of them useless. Is it truly like this? Does the problem change with the respect of the classification algorithm type? Any suggestion on papers and lectures are really welcome! Thanks

推荐答案

相关功能本身不会影响分类准确性。现实情况中的问题是,我们有数量有限的训练示例用于训练分类器。对于固定数量的训练示例,增加特征数量通常可以将分类精度提高到一定程度,但是随着特征数量的不断增加,分类精度最终将降低,因为相对于模型,我们欠采样大量功能。要了解有关此含义的更多信息,请参见维数诅咒

Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features continue to increase, classification accuracy will eventually decrease because we are then undersampled relative to the large number of features. To learn more about the implications of this, look at the curse of dimensionality.

如果两个数值特征完全相关,则一个不添加任何其他信息(由另一个确定)。因此,如果功能数量太多(相对于训练样本量),则通过减少功能数量是有益的特征提取技术(例如,通过主要组件

If two numerical features are perfectly correlated, then one doesn't add any additional information (it is determined by the other). So if the number of features is too high (relative to the training sample size), then it is beneficial to reduce the number of features through a feature extraction technique (e.g., via principal components)

相关性的影响确实取决于分类器的类型。一些非参数分类器对变量的相关性不太敏感(尽管训练时间可能会随着特征数量的增加而增加)。对于诸如高斯最大似然之类的统计方法,相对于训练样本大小而言,具有太多相关特征将使分类器在原始特征空间中无法使用(样本数据的协方差矩阵变为奇异)。

The effect of correlation does depend on the type of classifier. Some nonparametric classifiers are less sensitive to correlation of variables (although training time will likely increase with an increase in the number of features). For statistical methods such as Gaussian maximum likelihood, having too many correlated features relative to the training sample size will render the classifier unusable in the original feature space (the covariance matrix of the sample data becomes singular).

这篇关于相关功能和分类准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆