Scikit学习-测试集上的fit_transform [英] Scikit learn - fit_transform on the test set

查看:363
本文介绍了Scikit学习-测试集上的fit_transform的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力通过Scikit学习在Python中使用随机森林.我的问题是我将其用于文本分类(分为3类-正/负/中性),提取的特征主要是单词/字母组合,因此我需要将其转换为数字特征.我找到了一种使用DictVectorizerfit_transform的方法:

I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer's fit_transform:

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)

# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))

testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))

print "accuracy: " + str(Output)

我的问题是fit_transform方法正在火车数据集上工作,该数据集包含大约8000个实例,但是当我也尝试将测试集也转换为数值特征(大约80000个实例)时,出现内存错误说:

My problem is that the fit_transform method is working on the train dataset, which contains around 8000 instances, but when I try to convert my test set to numerical features too, which is around 80000 instances, I get a memory error saying that:

testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError

可能是什么原因引起的,有什么解决方法?非常感谢!

What could possibly cause this and is there any workaround? Many thanks!

推荐答案

您不应对测试数据执行fit_transform,而只能对transform进行.否则,您将获得与训练过程中使用的向量化不同的向量化.

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.

对于内存问题,我建议使用TfIdfVectorizer,它具有许多降低维数的选项(通过删除稀有的字母组合等).

For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).

更新

如果唯一的问题是拟合 test 数据,只需将其拆分为小块即可.而不是类似的

If the only problem is fitting test data, simply split it to small chunks. Instead of something like

x=vect.transform(test)
eval(x)

你可以做

K=10
for i in range(K):
    size=len(test)/K
    x=vect.transform(test[ i*size : (i+1)*size ])
    eval(x)

并记录结果/统计数据,然后对其进行分析.

and record results/stats and analyze them afterwards.

尤其是

predictions = []

K=10
for i in range(K):
    size=len(test)/K
    x=vect.transform(test[ i*size : (i+1)*size ])
    predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list

print accuracy_score( predictions, true_labels )

这篇关于Scikit学习-测试集上的fit_transform的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆