使用 scikit learn 获取信息量最大的特征有问题吗? [英] Problems obtaining most informative features with scikit learn?

查看:19
本文介绍了使用 scikit learn 获取信息量最大的特征有问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 文本语料库 中获取信息最丰富的特征.从这个回答良好的问题 我知道可以按如下方式完成此任务:

Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

那么:

most_informative_feature_for_class(tfidf_vect, clf, 5)

对于这个分类器:

X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values


from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                    y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)

问题是most_informative_feature_for_class的输出:

5 a_base_de_bien bastante   (0, 2451)   -0.210683496368
  (0, 3533) -0.173621065386
  (0, 8034) -0.135543062425
  (0, 10346)    -0.173621065386
  (0, 15231)    -0.154148294738
  (0, 18261)    -0.158890483047
  (0, 21083)    -0.297476572586
  (0, 434)  -0.0596263855375
  (0, 446)  -0.0753492277856
  (0, 769)  -0.0753492277856
  (0, 1118) -0.0753492277856
  (0, 1439) -0.0753492277856
  (0, 1605) -0.0753492277856
  (0, 1755) -0.0637950312345
  (0, 3504) -0.0753492277856
  (0, 3511) -0.115802483001
  (0, 4382) -0.0668983049212
  (0, 5247) -0.315713152154
  (0, 5396) -0.0753492277856
  (0, 5753) -0.0716096348446
  (0, 6507) -0.130661516772
  (0, 7978) -0.0753492277856
  (0, 8296) -0.144739048504
  (0, 8740) -0.0753492277856
  (0, 8906) -0.0753492277856
  : :
  (0, 23282)    0.418623443832
  (0, 4100) 0.385906085143
  (0, 15735)    0.207958503155
  (0, 16620)    0.385906085143
  (0, 19974)    0.0936828782325
  (0, 20304)    0.385906085143
  (0, 21721)    0.385906085143
  (0, 22308)    0.301270427482
  (0, 14903)    0.314164150621
  (0, 16904)    0.0653764031957
  (0, 20805)    0.0597723455204
  (0, 21878)    0.403750815828
  (0, 22582)    0.0226150073272
  (0, 6532) 0.525138162099
  (0, 6670) 0.525138162099
  (0, 10341)    0.525138162099
  (0, 13627)    0.278332617058
  (0, 1600) 0.326774799211
  (0, 2074) 0.310556919237
  (0, 5262) 0.176400451433
  (0, 6373) 0.290124806858
  (0, 8593) 0.290124806858
  (0, 12002)    0.282832270298
  (0, 15008)    0.290124806858
  (0, 19207)    0.326774799211

它不会返回标签或文字.为什么会发生这种情况,我该如何打印文字和标签?.自从我使用熊猫读取数据以来,你们会发生这种情况吗?我尝试的另一件事是以下内容,形成这个问题:

It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))


print_top10(tfidf_vect,clf,y)

但我得到了这个回溯:

回溯(最近一次调用最后一次):

Traceback (most recent call last):

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
    print_top10(tfidf_vect,clf,5)
  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
    for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable

知道如何解决这个问题,以获得具有最高系数值的特征吗?.

Any idea of how to solve this, in order to get the features with the highest coefficient values?.

推荐答案

为了专门针对线性 SVM 解决这个问题,我们首先要了解 sklearn 中 SVM 的公式以及它与 MultinomialNB 的区别.

To solve this specifically for linear SVM, we first have to understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB.

most_informative_feature_for_class 适用于 MultinomialNB 的原因是因为 coef_ 的输出本质上是给定类的特征的对数概率(因此大小为 <代码>[nclass, n_features],由于朴素贝叶斯问题的表述.但是如果我们检查documentation 对于 SVM,coef_ 不是那么简单.相反,coef_ 对于(线性)SVM 是[n_classes * (n_classes -1)/2, n_features] 因为每个二元模型都适合每个可能的类.

The reason why the most_informative_feature_for_class works for MultinomialNB is because the output of the coef_ is essentially the log probability of features given a class (and hence would be of size [nclass, n_features], due to the formulation of the naive bayes problem. But if we check the documentation for SVM, the coef_ is not that simple. Instead coef_ for (linear) SVM is [n_classes * (n_classes -1)/2, n_features] because each of the binary models are fitted to every possible class.

如果我们确实对我们感兴趣的特定系数有一些了解,我们可以将函数更改为如下所示:

If we do possess some knowledge on which particular coefficient we're interested in, we could alter the function to look like the following:

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):
    labelid = ?? # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

这将按预期工作,并根据您所追求的系数向量打印出标签和前 n 个特征.

This would work as intended and print out the labels and the top n features according to the coefficient vector that you're after.

至于获得特定类的正确输出,这取决于假设和您的目标输出.我建议通读 SVM 文档中的多类文档,以了解您所追求的内容.

As for getting the correct output for a particular class, that would depend on the assumptions and what you aim to output. I suggest reading through the multi-class documentation within the SVM documentation to get a feel for what you're after.

所以使用 train.txt 文件,在这个问题,我们可以得到一些一种输出,尽管在这种情况下它不是特别具有描述性或有助于解释.希望这对您有所帮助.

So using the train.txt file which was described in this question, we can get some kind of output, though in this situation it isn't particularly descriptive or helpful to interpret. Hopefully this helps you.

import codecs, re, time
from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

from sklearn.svm import SVC
svcc = SVC(kernel='linear', C=1)
svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
print 
most_informative_feature_for_class_svm(word_vectorizer, svcc)

带输出:

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

no 0.0204081632653
parecer 0.0204081632653
pone 0.0204081632653
por 0.0204081632653
relación 0.0204081632653
una 0.0204081632653
visto 0.0204081632653
ya 0.0204081632653
es 0.0408163265306
lo 0.0408163265306

这篇关于使用 scikit learn 获取信息量最大的特征有问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆