在scikit中检查分类器的技能学习 [英] Check skills of a classifier in scikit learn

查看:69
本文介绍了在scikit中检查分类器的技能学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练了一个分类器之后,我尝试传递一些句子来检查它是否能够正确分类.

After training a classifier, I tried passing a few sentences to check if it is going to classify it correctly.

在测试期间,结果显示不佳.

During that testing the results are not appearing well.

我认为某些变量不正确.

I suppose some variables are not correct.

说明

我有一个名为 df 的数据框,如下所示:

I have a dataframe called df that looks like this:

                                              news        type
0   From: mathew <mathew@mantis.co.uk>\n Subject: ...   alt.atheism
1   From: mathew <mathew@mantis.co.uk>\n Subject: ...   alt.space
2   From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...   alt.tech
                                                            ...
#each row in the news column is a document
#each row in the type column is the category of that document

预处理:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn import metrics

vectorizer = TfidfVectorizer( stop_words = 'english')
   
vectors = vectorizer.fit_transform(df.news)
clf =  SVC(C=10,gamma=1,kernel='rbf')

clf.fit(vectors, df.type)
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)

尝试检查某些句子的分类

texts = ["The space shuttle is made in 2018", 
         "stars are shining",
         "galaxy"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
   print('"{}"'.format(text))
   print("  - Predicted as: '{}'".format(df.type[pred]))

   print("")

问题在于它返回了以下内容:

"The space shuttle is made in 2018"
  - Predicted as: 'alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN

您怎么看?

示例

这是它的外观:

>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)

>>> predicted = clf.predict(X_new_tfidf)

>>> for doc, category in zip(docs_new, predicted):
...     print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

推荐答案

训练分类器所用的数据与测试分类器所用的短语明显不同.正如您在对我的第一个答案的评论中提到的那样,您获得了90%以上的准确度,这是相当不错的.但是,您很难使分类器对邮件列表项进行分类,而邮件列表项是较长的文档,其中包含电子邮件地址.您的短语(例如航天飞机制造于2018年")很短,并且不包含电子邮件地址.您的分类器有可能使用那些电子邮件地址对文档进行分类,从而解释了良好的结果.如果您在训练前从数据中删除电子邮件地址,您可以测试是否真的如此.

The data with which you train your classifier is significantly different to the phrases you test it on. As you mentioned in your comment on my first answer, you get an accuracy of more than 90%, which is pretty good. But you tought your classifier to classify mailing list items which are long documents with e-mail adresses in them. Your phrases such as "The space shuttle is made in 2018" are pretty short and do not contain e-mail adresses. Its possible that your classifier uses those e-mail adresses to classify the documents, which explaines the good results. You can test if that is really the case if you remove the e-mail adresses from the data before training.

这篇关于在scikit中检查分类器的技能学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆