尽管词频正确,但分类很差 [英] Classification is poor although term frequency is right

查看:66
本文介绍了尽管词频正确,但分类很差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下功能检查每种类别中最常用的单词,然后观察如何对某些句子进行分类。结果出人意料地是错误的:

I am checking using the below function what are the most frequent words per category and then observe how some sentences would be classified. The results are surprisingly wrong:

#The function
def show_top10(classifier, vectorizer, categories):
...     feature_names = np.asarray(vectorizer.get_feature_names())
...     for i, category in enumerate(categories):
...         top10 = np.argsort(classifier.coef_[i])[-10:]
...         print("%s: %s" % (category, " ".join(feature_names[top10])))

#Using the function on the data
show_top10(clf, vectorizer, newsgroups_train.target_names)

#The results seem to be logical
#the most frequent words by category are these:
rec.autos: think know engine don new good just like cars car
rec.motorcycles: riding helmet don know ride bikes dod like just bike
sci.space: don earth think orbit launch moon just like nasa space

#Now, testing these sentences, we see that they are classified wrong and not based 
#on the above most frequent words

texts = ["The space shuttle is made in 2018", 
    "The car is noisy.",
    "bikes and helmets"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
   print('"{}"'.format(text))
   print("  - Predicted as: '{}'".format(cats[predicted]))
   print("")

和结果是:

"The space shuttle is made in 2018"
  - Predicted as: 'rec.motorcycles'

"The car is noisy."
  - Predicted as: 'sci.space'

"bikes and helmets"
  - Predicted as: 'rec.autos'

完全错误。

如果需要,可以在下面看到分类代码。

The code of the classification can be seen below if needed.

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics


cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',
                           remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',
                           remove=('headers', 'footers', 'quotes'), categories = cats)

vectorizer = TfidfVectorizer(max_features = 1000,max_df = 0.5,
                            min_df = 5, stop_words='english')


vectors = vectorizer.fit_transform(newsgroups_train.data)

vectors_test = vectorizer.transform(newsgroups_test.data)

clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)


推荐答案

cat 变量和 newsgroup_train.target_names中名称的顺序是不同的。分配了 target_names 的标签已排序,请参见在这里

The order of names in cat variable and newsgroup_train.target_names is different. The labels assigned target_names are sorted, see here

输出:
print(cat)


['sci.space','rec.autos','rec.motorcycles']

['sci.space','rec.autos','rec.motorcycles']

print(newsgroups_train.target_names)


['rec.autos','rec.motorcycles','sci.space']

['rec.autos', 'rec.motorcycles', 'sci.space']

您应该这样一行:

print(-预测为:'{}'。format(cats [predicted]) )

打印(-预测为:'{}'。format(newsgroup_train.target_names [predicted]))

这篇关于尽管词频正确,但分类很差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆