合并词袋分类器与任意数字字段 [英] Merging bag-of-words scikits classifier with arbitrary numeric fields

查看:89
本文介绍了合并词袋分类器与任意数字字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您将如何合并在袋子上运行的scikits-learn 分类器 -of-words具有可在任意数字字段上操作的字词?

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?

我知道这些内容在幕后基本上是相同的,但是我很难弄清楚了解如何通过现有的库方法执行此操作。例如,我的词袋分类器使用管道:

I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:

classifier = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])

而我的其他用法是:

classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])

我如何构造一个可以同时使用两组数据进行训练的LinearSVC分类器?例如,

How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.

classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])


推荐答案

简单方法:

import scipy.sparse

tfidf = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)

X_other = load_your_other_features()

X = scipy.sparse.hstack([X_tfidf, X_other)

clf = LinearSVC().fit(X, y)

有原则的解决方案是将哈希值包装在一起,使您可以将所有内容保存在一个 Pipeline 中。 ,tf-idf和您的其他特征提取方法放在几个简单的转换器对象中,并将它们放入 FeatureUnion 中,但是很难从信息中看出代码的样子

The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.

(请注意,在邮件列表上,我一直这样说在其他地方, OneVsRestClassifier(LinearSVC())是没有用的。 LinearSVC 开箱即用即可进行OvR,因此,这只是安装OvR SVM的较慢方法。)

(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

这篇关于合并词袋分类器与任意数字字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆