scikit学习:所需数量的最佳功能(k)未选中 [英] scikit learn: desired amount of Best Features (k) not selected
问题描述
我正在尝试使用卡方(scikit-learn 0.10)选择最佳功能.首先从总共80份培训文档中提取227个功能,然后从这227个功能中选择前10个.
I am trying to select the best features using chi-square (scikit-learn 0.10). From a total of 80 training documents I first extract 227 feature, and from these 227 features I want to select the top 10 ones.
my_vectorizer = CountVectorizer(analyzer=MyAnalyzer())
X_train = my_vectorizer.fit_transform(train_data)
X_test = my_vectorizer.transform(test_data)
Y_train = np.array(train_labels)
Y_test = np.array(test_labels)
X_train = np.clip(X_train.toarray(), 0, 1)
X_test = np.clip(X_test.toarray(), 0, 1)
ch2 = SelectKBest(chi2, k=10)
print X_train.shape
X_train = ch2.fit_transform(X_train, Y_train)
print X_train.shape
结果如下.
(80, 227)
(80, 14)
如果我将k
设置为100
,则它们相似.
They are similar if I set k
equal to 100
.
(80, 227)
(80, 227)
为什么会这样?
* 一个完整的输出示例,现在不包含剪辑,我请求30却得到32:
* A full output example , now without clipping, where I request 30 and got 32 instead:
Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
[0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
[0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 32)
Using 32(requested:30) best features from 9 training documents
get support:
[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True]
get support with vocabulary :
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
scale_C)
Classifying...
另一个没有裁剪的示例,我请求10却得到11:
Another example without clipping, where I request 10 and get 11 instead:
Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
[0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
[0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 11)
Using 11(requested:10) best features from 9 training documents
get support:
[ True True True False False True False False False False True False
False False True False False False True False True False True True
False False False False True False False False]
get support with vocabulary :
[ 0 1 2 5 10 14 18 20 22 23 28]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
scale_C)
Classifying...
推荐答案
您是否检查了get_support()
函数返回的内容(ch2
应该具有此成员函数)?这将返回在最佳k中选择的索引.
Have you checked what is returned from the get_support()
function (ch2
should have this member function)? This returns the indices being selected among the best k.
我的推测是由于您正在执行的数据裁剪(或由于重复的特征向量,如果您的特征向量是分类的并且可能重复)而存在联系,并且scikits函数会返回所有条目并列前k名.设置k = 100
的另一个示例对此猜想产生了一些疑问,但值得一看.
My conjecture is that there are ties due to the data clipping that you're doing (or due to repeated feature vectors, if your feature vectors are categorical and are likely to have repeats), and that the scikits function returns all entries that are tied for the top k spots. The extra example where you set k = 100
casts some doubt on this conjecture, but it's worth a look.
查看get_support()
返回的内容,并查看这些索引上的X_train
外观,查看裁剪是否导致大量特征重叠,从而在SelectKBest
使用的chi ^ 2 p值等级中创建联系
See what get_support()
returns, and check what X_train
looks like on those indices, see if clipping results in a lot of feature overlap, creating ties in the chi^2 p-value ranks that SelectKBest
is using.
如果是这种情况,您应该向scikits.learn提交错误/问题,因为当前他们的文档没有说明SelectKBest
在发生联系时会做什么.显然,它不能只获取某些绑定索引,而不能获取其他绑定索引,但至少应警告用户,绑定可能会导致意外的特征维数减少.
If this turns out to be the case, you should file a bug / issue with scikits.learn, because currently their documentation does not say what SelectKBest
will do in the event of ties. Clearly it can't just take some of the tied indices and not others, but users should at least be warned that ties could result in unexpected feature dimensionality reduction.
这篇关于scikit学习:所需数量的最佳功能(k)未选中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!