在Scikit-Learn功能提取中合并CountVectorizer [英] Merging CountVectorizer in Scikit-Learn feature extraction

查看:111
本文介绍了在Scikit-Learn功能提取中合并CountVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scikit学习的新手,需要一些我正在从事的工作的帮助.

I am new to scikit-learn and needed some help with something that I have been working on.

我正在尝试使用多项朴素贝叶斯分类对两种类型的文档(例如A型和B型)进行分类.为了获得这些文档的术语计数,我在sklearn.feature_extraction.text中使用CountVectorizer类.

I am trying to classify two types of documents (say, type A and type B) using Multinomial Naive Bayes classification. In order to get the term counts for these documents, I am using the CountVectorizer class in sklearn.feature_extraction.text.

问题在于这两种类型的文档需要不同的正则表达式来提取令牌(CountVectorization的token_pattern参数).我似乎找不到一种方法来先加载类型A的培训文档,然后再加载类型B的培训文档.是否可以执行以下操作:

The problem is that the two types of documents require different regular expressions to extract tokens (token_pattern parameter to CountVectorization). I can't seem to find a way to first load the training documents of type A and then of type B. Is it possible to do something like:

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit(list_of_type_A_document_content)
...
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit(list_of_type_B_document_content)
...
# Somehow merge the two vectorizers results and get the final sparse matrix

推荐答案

您可以尝试:

vecA = CountVectorizer(token_pattern="[a-zA-Z]+", ...)
vecA.fit_transform(list_of_type_A_document_content)
vecB = CountVectorizer(token_pattern="[a-zA-Z0-9]+", ...)
vecB.fit_transform(list_of_type_B_document_content)
combined_features = FeatureUnion([('CountVectorizer', vectA),('CountVect', vectB)])
combined_features.transform(test_data)

您可以从中阅读有关FeatureUnion的更多信息 http://scikit-learn.org/stable/modules/generate/sklearn.pipeline.FeatureUnion.html

You can read more about FeatureUnion from http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

可从版本0.13.1中获得

which is available from version 0.13.1

这篇关于在Scikit-Learn功能提取中合并CountVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆