选择sklearn管道对用户文本数据进行分类 [英] Choosing an sklearn pipeline for classifying user text data

查看:67
本文介绍了选择sklearn管道对用户文本数据进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python(使用sklearn模块)开发机器学习应用程序,并且目前正在尝试确定执行推理的模型.问题的简要说明:

I'm working on a machine learning application in Python (using the sklearn module), and am currently trying to decide on a model for performing inference. A brief description of the problem:

鉴于用户数据的许多实例,我试图根据相对关键字包含将它们分类为各种类别.它是受监督的,所以我有很多已经分类的预分类数据实例.(每条数据在2到12个左右的单词之间.)

Given many instances of user data, I'm trying to classify them into various categories based on relative keyword containment. It is supervised, so I have many, many instances of pre-classified data that are already categorized. (Each piece of data is between 2 and 12 or so words.)

我目前正在尝试在两个潜在模型之间做出决定:

I am currently trying to decide between two potential models:

  1. CountVectorizer +多项朴素贝叶斯.使用sklearn的CountVectorizer获得整个训练数据中的关键字计数.然后,使用朴素贝叶斯(Naive Bayes)通过sklearn的MultinomialNB模型对数据进行分类.

  1. CountVectorizer + Multinomial Naive Bayes. Use sklearn's CountVectorizer to obtain keyword counts across the training data. Then, use Naive Bayes to classify data using sklearn's MultinomialNB model.

对关键字计数+标准朴素贝叶斯(Naive Bayes)使用tf-idf术语加权.使用CountVectorizer获得训练数据的关键字计数矩阵,使用sklearn的TfidfTransformer将数据转换为tf-idf加权,然后将其转储到标准的朴素贝叶斯模型中.

Use tf-idf term weighting on keyword counts + standard Naive Bayes. Obtain a keyword count matrix for the training data using CountVectorizer, transform that data to be tf-idf weighted using sklearn's TfidfTransformer, and then dump that into a standard Naive Bayes model.

我已经阅读了两种方法中使用的类的文档,并且似乎都很好地解决了我的问题.

I've read through the documentation for the classes use in both methods, and both seem to address my problem very well.

对于这种类型的问题,为什么使用标准朴素贝叶斯模型进行tf-idf加权可能胜过多项式朴素贝叶斯?两种方法都存在明显的问题吗?

推荐答案

朴素贝叶斯和MultinomialNB是相同的算法.您得到的不同之处在于tfidf转换,该转换对语料库中许多文档中出现的单词进行了惩罚.

Naive Bayes and MultinomialNB are the same algorithms. The difference that you get is from the tfidf transformation which penalises the words that occur in lots of documents in your corpus.

我的建议:使用tfidf并调整特征TfidfVectorization的sublinear_tf,二进制参数和归一化参数.

My advice: Use tfidf and tune the sublinear_tf, binary parameters and normalization parameters of TfidfVectorization for features.

还尝试scikit-learn中可用的各种不同分类器,如果您正确调整正则化类型的值(八度惩罚l1或l2)和正则化参数(alpha),我怀疑这会给您带来更好的结果.

Also try all kind of different classifiers available in scikit-learn which i suspect will give you better results if you properly tune the value of regularization type (penalty eighther l1 or l2) and the regularization parameter (alpha).

如果您正确地调整它们,我怀疑使用SGDClassifier并具有'log'损失(逻辑回归)或'hinge'损失(SVM)可以得到更好的结果.

If you tune them properly I suspect you can get much better results using SGDClassifier with 'log' loss (Logistic Regression) or 'hinge' loss (SVM).

人们通常调整参数的方法是通过scikit-learn中的GridSearchCV类.

The way people usually tune the parameters is through GridSearchCV class in scikit-learn.

这篇关于选择sklearn管道对用户文本数据进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆