scikit学习:所需数量的最佳功能(k)未选中 [英] scikit learn: desired amount of Best Features (k) not selected

查看:74
本文介绍了scikit学习:所需数量的最佳功能(k)未选中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用卡方(scikit-learn 0.10)选择最佳功能.首先从总共80份培训文档中提取227个功能,然后从这227个功能中选择前10个.

I am trying to select the best features using chi-square (scikit-learn 0.10). From a total of 80 training documents I first extract 227 feature, and from these 227 features I want to select the top 10 ones.

my_vectorizer = CountVectorizer(analyzer=MyAnalyzer())      
X_train = my_vectorizer.fit_transform(train_data)
X_test = my_vectorizer.transform(test_data)
Y_train = np.array(train_labels)
Y_test = np.array(test_labels)
X_train = np.clip(X_train.toarray(), 0, 1)
X_test = np.clip(X_test.toarray(), 0, 1)    
ch2 = SelectKBest(chi2, k=10)
print X_train.shape
X_train = ch2.fit_transform(X_train, Y_train)
print X_train.shape

结果如下.

(80, 227)
(80, 14)

如果我将k设置为100,则它们相似.

They are similar if I set k equal to 100.

(80, 227)
(80, 227)

为什么会这样?

* 一个完整​​的输出示例,现在不包含剪辑,我请求30却得到32:

* A full output example , now without clipping, where I request 30 and got 32 instead:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 32)
Using 32(requested:30) best features from 9 training documents
get support:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True]
get support with vocabulary :
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...

另一个没有裁剪的示例,我请求10却得到11:

Another example without clipping, where I request 10 and get 11 instead:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 11)
Using 11(requested:10) best features from 9 training documents
get support:
[ True  True  True False False  True False False False False  True False
 False False  True False False False  True False  True False  True  True
 False False False False  True False False False]
get support with vocabulary :
[ 0  1  2  5 10 14 18 20 22 23 28]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...

推荐答案

您是否检查了get_support()函数返回的内容(ch2应该具有此成员函数)?这将返回在最佳k中选择的索引.

Have you checked what is returned from the get_support() function (ch2 should have this member function)? This returns the indices being selected among the best k.

我的推测是由于您正在执行的数据裁剪(或由于重复的特征向量,如果您的特征向量是分类的并且可能重复)而存在联系,并且scikits函数会返回所有条目并列前k名.设置k = 100的另一个示例对此猜想产生了一些疑问,但值得一看.

My conjecture is that there are ties due to the data clipping that you're doing (or due to repeated feature vectors, if your feature vectors are categorical and are likely to have repeats), and that the scikits function returns all entries that are tied for the top k spots. The extra example where you set k = 100 casts some doubt on this conjecture, but it's worth a look.

查看get_support()返回的内容,并查看这些索引上的X_train外观,查看裁剪是否导致大量特征重叠,从而在SelectKBest使用的chi ^ 2 p值等级中创建联系

See what get_support() returns, and check what X_train looks like on those indices, see if clipping results in a lot of feature overlap, creating ties in the chi^2 p-value ranks that SelectKBest is using.

如果是这种情况,您应该向scikits.learn提交错误/问题,因为当前他们的文档没有说明SelectKBest在发生联系时会做什么.显然,它不能只获取某些绑定索引,而不能获取其他绑定索引,但至少应警告用户,绑定可能会导致意外的特征维数减少.

If this turns out to be the case, you should file a bug / issue with scikits.learn, because currently their documentation does not say what SelectKBest will do in the event of ties. Clearly it can't just take some of the tied indices and not others, but users should at least be warned that ties could result in unexpected feature dimensionality reduction.

这篇关于scikit学习:所需数量的最佳功能(k)未选中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆