用于分类的 Python 向量化 [英] Python vectorization for classification

查看:50
本文介绍了用于分类的 Python 向量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试构建一个包含大约 80 个类的文本分类模型(文档分类).当我使用随机森林构建和训练模型时(在将文本向量化为 TF-IDF 矩阵之后),该模型运行良好.但是,当我引入新数据时,我用来构建 RF 的相同词不一定与训练集相同.这是一个问题,因为我的训练集中的特征数量与测试集中的特征数量不同(因此训练集的维度小于测试).

I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).

####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421


####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)

####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()


### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619


clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())

我如何着手创建用于分类的有效 RF 模型,该模型将包含在训练中未见过的新特征(单词)?

How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?

推荐答案

不要对看不见的数据调用fit_transform,只调用transform!这将使字典远离训练集.

Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.

这篇关于用于分类的 Python 向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆