如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类? [英] How do I classify documents with SciKitLearn using TfIdfVectorizer?

查看:278
本文介绍了如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下示例显示了如何使用Sklearn 20新闻组数据训练分类器.

The following example shows how one can train a classifier with the Sklearn 20 newsgroups data.

>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', ... categories=categories) 
>>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data) 
>>> vectors.shape (2034, 34118)

但是,我有自己想使用的标注语料库.

However, I have my own labeled corpus that I would like to use.

获取我自己的数据的tfidfvector后,我会训练这样的分类器吗?

After getting a tfidfvector of my own data, would I train a classifier like this?

classif_nb = nltk.NaiveBayesClassifier.train(vectorizer)

回顾一下: 如何使用我自己的语料库而不是20newsgroups,但如何使用这里的方法? 然后如何使用TFIDFVectorized语料库来训练分类器?

To recap: How can I use my own corpus instead of the 20newsgroups, but in the same way used here? How can I then use my TFIDFVectorized corpus to train a classifier?

谢谢!

推荐答案

解决评论中的问题;在某些分类任务中使用tfidf表示形式的整个基本过程应该是:

To address questions from comments; The whole basic process of working with tfidf representation in some classification task you should:

  1. 使适合您的训练数据,并将其保存在某个变量中,让我们将其称为 tfidf
  2. 您通过data = tfidf.transform(...)变换训练数据(没有标签,只有文本)
  3. 您使用some_classifier.fit(data,labels)拟合模型(分类器),其中标签与数据中的文档顺序相同
  4. 在测试过程中,对新数据使用tfidf.transform(...),并检查模型的预测
  1. You fit the vectorizer to your training data and save it in some variable, lets call it tfidf
  2. You transform training data (without labels, just text) through data = tfidf.transform(...)
  3. You fit the model (classifier) using some_classifier.fit( data, labels ), where labels are in the same order as documnents in data
  4. During testing you use tfidf.transform( ... ) on new data, and check the prediction of your model

这篇关于如何使用TfIdfVectorizer使用SciKitLearn对文档进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆