合并词袋分类器与任意数字字段 [英] Merging bag-of-words scikits classifier with arbitrary numeric fields

查看：89 发布时间：2020/10/2 3:10:34 python classification scikit-learn

本文介绍了合并词袋分类器与任意数字字段的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您将如何合并在袋子上运行的scikits-learn 分类器 -of-words具有可在任意数字字段上操作的字词？

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?

我知道这些内容在幕后基本上是相同的，但是我很难弄清楚了解如何通过现有的库方法执行此操作。例如，我的词袋分类器使用管道：

I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:

classifier = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])

而我的其他用法是：

classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])

我如何构造一个可以同时使用两组数据进行训练的LinearSVC分类器？例如，

How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.

classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])

推荐答案

简单方法：

import scipy.sparse

tfidf = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)

X_other = load_your_other_features()

X = scipy.sparse.hstack([X_tfidf, X_other)

clf = LinearSVC().fit(X, y)

有原则的解决方案是将哈希值包装在一起，使您可以将所有内容保存在一个 Pipeline 中。，tf-idf和您的其他特征提取方法放在几个简单的转换器对象中，并将它们放入 FeatureUnion 中，但是很难从信息中看出代码的样子

The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.

（请注意，在邮件列表上，我一直这样说在其他地方， OneVsRestClassifier（LinearSVC（））是没有用的。 LinearSVC 开箱即用即可进行OvR，因此，这只是安装OvR SVM的较慢方法。）

(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

这篇关于合并词袋分类器与任意数字字段的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

合并词袋分类器与任意数字字段 [英] Merging bag-of-words scikits classifier with arbitrary numeric fields

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

合并词袋分类器与任意数字字段 [英] Merging bag-of-words scikits classifier with arbitrary numeric fields

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭