合并词袋分类器与任意数字字段 [英] Merging bag-of-words scikits classifier with arbitrary numeric fields
问题描述
您将如何合并在袋子上运行的scikits-learn 分类器 -of-words具有可在任意数字字段上操作的字词?
How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
我知道这些内容在幕后基本上是相同的,但是我很难弄清楚了解如何通过现有的库方法执行此操作。例如,我的词袋分类器使用管道:
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
而我的其他用法是:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
我如何构造一个可以同时使用两组数据进行训练的LinearSVC分类器?例如,
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])
推荐答案
简单方法:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other)
clf = LinearSVC().fit(X, y)
有原则的解决方案是将哈希值包装在一起,使您可以将所有内容保存在一个 Pipeline
中。 ,tf-idf和您的其他特征提取方法放在几个简单的转换器对象中,并将它们放入 FeatureUnion
中,但是很难从信息中看出代码的样子
The principled solution, which allows you to keep everything in one Pipeline
, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion
, but it's hard to tell what the code would look like from the information you've given.
(请注意,在邮件列表上,我一直这样说在其他地方, OneVsRestClassifier(LinearSVC())
是没有用的。 LinearSVC
开箱即用即可进行OvR,因此,这只是安装OvR SVM的较慢方法。)
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC())
is useless. LinearSVC
does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)
这篇关于合并词袋分类器与任意数字字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!