如何在scikit-learn中将数字特征与文本(单词袋)正确组合? [英] How do I properly combine numerical features with text (bag of words) in scikit-learn?

查看:111
本文介绍了如何在scikit-learn中将数字特征与文本(单词袋)正确组合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为网页编写分类器,因此我混合使用了数字功能,并且我也想对文本进行分类。我正在使用词袋方法将文本转换为(大)数字矢量。代码最终像这样:

I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

numerical_features = [
  [1, 0],
  [1, 1],
  [0, 0],
  [0, 1]
]
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)

bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

这可行,但我担心准确性。请注意,有4个对象,只有两个数字特征。即使是最简单的文本,也会产生具有9个特征的向量(因为语料库中有9个不同的词)。显然,对于真实文本,将有成百上千个不同的单词,因此最终的特征向量应小于<。 10个数字特征,但基于1000个单词的特征。

This works, but I'm concerned about the accuracy. Notice that there are 4 objects, and only two numerical features. Even the simplest text results in a vector with nine features (because there are nine distinct words in the corpus). Obviously, with real text, there will be hundreds, or thousands of distinct words, so the final feature vector would be < 10 numerical features but > 1000 words based ones.

因此,分类器(SVM)不会对数字特征上的单词的权重大为100比1如果是这样,我该如何补偿以确保单词袋相对于数字特征加权均匀?

Because of this, won't the classifier (SVM) be heavily weighting the words over the numerical features by a factor of 100 to 1? If so, how can I compensate to make sure the bag of words is weighted equally against the numerical features?

推荐答案

我认为您的对于以幼稚的方式(作为多热向量)从稀疏文本令牌产生的更高维度,关注是完全正确的。您至少可以通过以下两种方法解决该问题。它们都会从文本中产生一个低维向量(例如100维)。当词汇量增加时,维度不会增加。

I think your concern is totally valid regarding the significantly higher dimension produced from sparse text tokens in a naive way (as multi-hot vectors). You could at least tackle that with two approaches below. Both of them will produce a low-dimensional vector (for example, 100-dimension) from the text. The dimension is not going to increase when your vocabulary increases.

  • with feature hashing. This applies to your bag of words model.
  • with word embedding (an example usage that works with scikit-learn) or more advanced text encoders, like universal sentence encoder or any variant of the state-of-art BERT encoder.

这篇关于如何在scikit-learn中将数字特征与文本(单词袋)正确组合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆