使用 scikit-learn 进行特征选择 [英] Feature selection using scikit-learn

查看:57
本文介绍了使用 scikit-learn 进行特征选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是机器学习的新手.我正在准备使用 Scikit Learn SVM 进行分类的数据.为了选择最佳功能,我使用了以下方法:

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

由于我的数据集包含负值,因此出现以下错误:

Since my dataset consist of negative values, I get the following error:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X)
    191     if np.any((X.data if issparse(X) else X) < 0):
--> 192         raise ValueError("Input X must be non-negative.")
    193 
    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

谁能告诉我如何转换我的数据?

Can someone tell me how can I transform my data ?

推荐答案

错误信息 Input X must be non-negative 说明了一切:Pearson 卡方检验(拟合优度) 不适用于负值.这是合乎逻辑的,因为卡方检验假设频率分布并且频率不能是负数.因此,sklearn.feature_selection.chi2 断言输入是非负的.

The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

您是说您的特征是加速度计信号的最小值、最大值、平均值、中值和 FFT".在许多情况下,简单地移动每个特征以使其全部为正值,或者甚至按照 EdChum 的建议归一化到 [0, 1] 区间可能是非常安全的.

You are saying that your features are "min, max, mean, median and FFT of accelerometer signal". In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

如果由于某种原因无法进行数据转换(例如,负值是一个重要因素),您应该选择另一个统计数据来为您的特征评分:

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

由于此过程的全部目的是为另一种方法准备特征,因此选择任何人都不是什么大问题,最终结果通常相同或非常接近.

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

这篇关于使用 scikit-learn 进行特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆