如何标准化培训和测试用词袋? [英] How to standardize the bag of words for train and test?

查看：67 发布时间：2020/5/18 1:10:01 nlp nltk

本文介绍了如何标准化培训和测试用词袋?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试根据NLP的词袋模型进行分类.

I am trying to classify based on the bag-of-words model from NLP.

是否使用NLTK预处理了火车数据(标点，停用词，小写字母，词干等)
为火车创建了tf-idf矩阵.
已对测试进行了预处理.
为测试数据创建了tf-idf矩阵.
训练数据和测试数据的词袋不同，因此特征的编号也不同，因此我们不能使用类似knn的分类算法.
我将火车和测试数据合并在一起，并创建了tf-idf矩阵.这就解决了以上不同词袋的问题.但是结果矩阵太大，无法处理.

这是我的问题:

有没有一种方法可以创建用于培训和测试的确切单词包?
如果没有，并且我添加训练和测试的方法是正确的，我应该选择像LDA这样的降维算法吗?

推荐答案

您可以使用scikit Learn的count矢量化器首先为文档中的给定单词创建矢量，使用它来训练您选择的分类器，然后使用分类器测试您的数据.

You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.

对于训练集，您可以使用矢量化器来训练数据，如下所示:

For the training set, you can use the vectorizer to train the data as follows:

 LabeledWords=pd.DataFrame(columns=['word','label'])

 LabeledWords.append({'word':'Church','label':'Religion'} )

 vectorizer = CountVectorizer()

 Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()

然后，您可以使用上述矢量化工具训练您选择的分类器，例如:

You can then train the classifier of your choice with the above vectorizer like:

forest = RandomForestClassifier(n_estimators = 100) 
clf=forest.fit(Xtrain,yTrain)

为了测试您的数据:

for each_word,label in Preprocessed_list:
    test_featuresX.append(vectorizer.transform(each_word),toarray())
    test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY)

这篇关于如何标准化培训和测试用词袋?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何标准化培训和测试用词袋? [英] How to standardize the bag of words for train and test?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何标准化培训和测试用词袋? [英] How to standardize the bag of words for train and test?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭