改进流程Python分类器并结合功能 [英] Improve flow Python classifier and combine features

查看:93
本文介绍了改进流程Python分类器并结合功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个分类器以对网站进行分类.我是第一次这样做,所以对我来说,这都是很新的.目前,我正在尝试在网页的几个部分(例如标题,文本,标题)上做一些单词袋".看起来像这样:

I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this:

from sklearn.feature_extraction.text import CountVectorizer
countvect_text = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_title = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english")

X_tr_text_counts = countvect_text.fit_transform(tr_data['text'])
X_tr_title_counts = countvect_title.fit_transform(tr_data['title'])
X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings'])

from sklearn.feature_extraction.text import TfidfTransformer

transformer_text = TfidfTransformer(use_idf=True)
transformer_title = TfidfTransformer(use_idf=True)
transformer_headings = TfidfTransformer(use_idf=True)

X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts)
X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts)
X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts)

from sklearn.naive_bayes import MultinomialNB
text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class'])
title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class'])
headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class'])

X_te_text_counts = countvect_text.transform(te_data['text'])
X_te_title_counts = countvect_title.transform(te_data['title'])
X_te_headings_counts = countvect_headings.transform(te_data['headings'])

X_te_text_tfidf = transformer_text.transform(X_te_text_counts)
X_te_title_tfidf = transformer_title.transform(X_te_title_counts)
X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts)

accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class'])
accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class'])
accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])

这很好,我得到了预期的精度.但是,您可能已经猜到了,这看起来很混乱,并且充满了重复.然后我的问题是,有没有一种更简洁的方式写这个?

This works fine, and I get the accuracies as expected. However, as you might have guessed, this looks cluttered and is filled with duplication. My question then is, is there a way to write this more concisely?

此外,我不确定如何将这三个功能组合到单个多项式分类器中.我尝试将tfidf值列表传递给MultinomialNB().fit(),但是显然不允许这样做.

Additionally, I am not sure how I can combine these three features into a single multinomial classifier. I tried passing a list of tfidf values to MultinomialNB().fit(), but apparently that's not allowed.

可选地,向特征添加权重也将是一件很不错的事情,这样在最终的分类器中,某些矢量比其他矢量具有更高的重要性.

Optionally, it would also be nice to add weights to the features, so that in the final classifier some vectors have a higher importance than others.

我猜我需要pipeline,但是我完全不确定在这种情况下应该如何使用它.

I'm guessing I need pipeline but I'm not at all sure how I should use it in this case.

推荐答案

首先,可以使用

First, CountVectorizer and TfidfTransformer can be removed by using TfidfVectorizer (which is essentially combination of both).

第二,可以将TfidfVectorizer和MultinomialNB组合在管道. 流水线顺序应用变换列表和最终估计器.在Pipeline上调用fit()时,它将一个接一个地拟合所有变换并对数据进行变换,然后使用最终估计量拟合已变换的数据.调用score()predict()时,仅在所有变压器上调用transform(),在最后一个变压器上调用score()predict().

Second, the TfidfVectorizer and MultinomialNB can be combined in a Pipeline. A pipeline sequentially apply a list of transforms and a final estimator. When fit() is called on a Pipeline, it fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And when score() or predict() is called, it only call transform() on all transformers and score() or predict() on last one.

因此代码如下:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252",
                                                    stop_words="english",
                                                    use_idf=True)), 
                     ('nb', MultinomialNB())])

accuracy={}
for item in ['text', 'title', 'headings']:

    # No need to save the return of fit(), it returns self
    pipeline.fit(tr_data[item], tr_data['class'])

    # Apply transforms, and score with the final estimator
    accuracy[item] = pipeline.score(te_data[item], te_data['class'])

编辑: 编辑以包括所有功能的组合以获得单精度:

EDIT: Edited to include the combining of all features to get single accuracy:

要合并结果,我们可以采用多种方法.以下是一个容易理解的内容(但又到了混乱的一面):

To combine the results, we can follow multiple approaches. One that is easily understandable (but a bit of again going to the cluttery side) is the following:

# Using scipy to concatenate, because tfidfvectorizer returns sparse matrices
from scipy.sparse import hstack

def get_tfidf(tr_data, te_data, columns):

    train = None
    test = None

    tfidfVectorizer = TfidfVectorizer(encoding="cp1252",
                                      stop_words="english",
                                      use_idf=True)
    for item in columns:
        temp_train = tfidfVectorizer.fit_transform(tr_data[item])
        train = hstack((train, temp_train)) if train is not None else temp_train

        temp_test = tfidfVectorizer.transform(te_data[item])
        test = hstack((test , temp_test)) if test is not None else temp_test

    return train, test

train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings']) 

nb = MultinomialNB()
nb.fit(train_tfidf, tr_data['class'])
nb.score(test_tfidf, te_data['class'])

第二种方法(也是更可取的方法)是将所有这些都包括在管道中.但是由于选择了不同的列(文本",标题",标题")并将结果串联起来,所以并不是那么简单.我们需要为他们使用FeatureUnion.特别是以下示例:

Second approach (and more preferable) will be to include all these in pipeline. But due to selecting the different columns ('text', 'title', 'headings') and concatenating the results, its not that straightforward. We need to use FeatureUnion for them. And specifically the following example:

第三,如果您愿意使用其他库,请 DataFrameMapper sklearn-pandas中的可以简化上一示例中使用的FeatureUnions的使用.

Third, if you are open to use other libraries, then DataFrameMapper from sklearn-pandas can simplify the usage of FeatureUnions used in previous example.

如果您确实想走第二条或第三条道路,请在遇到任何困难时随时与我们联系.

If you do want to go the second or third way, please feel free to contact if having any difficulties.

注意:我尚未检查代码,但是它应该可以工作(减少一些语法错误,如果有的话).会尽快在我的电脑上检查.

NOTE: I have not checked the code, but it should work (less some syntax errors, if any). Will check as soon as on my pc.

这篇关于改进流程Python分类器并结合功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆