使用 CountVectorizer 在 python 中进行密集时出现内存错误 [英] memory error when todense in python using CountVectorizer

查看:48
本文介绍了使用 CountVectorizer 在 python 中进行密集时出现内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是调用 todense() 时我的代码和内存错误,我使用的是 GBDT 模型,想知道是否有人有解决内存错误的好主意?谢谢.

Here is my code and memory error when call todense(), I am using GBDT model, and wondering if anyone have good ideas how to work around memory error? Thanks.

  for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
  y_train = y_train.astype('int')
  grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
  grd.fit(X_train.values, y_train.values)

详细的错误信息,

in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...

问候,林

推荐答案

这里有很多问题:

for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()

1) 您正在尝试将多列(CountVectorizer 的结果将是一个二维数组,其中列代表特征)分配给单个列 'feature_colunm_name'数据帧.那是行不通的,会产生错误.

1) You are trying to assign mutliple columns (result of CountVectorizer will be a 2-d array where columns represent features) to a single column 'feature_colunm_name' of DataFrame. Thats not going to work and will produce error.

2) 您在测试数据上再次拟合了 CountVectorizer,这是错误的.您应该在用于训练数据的测试数据上使用相同的 CountVectorizer 对象,并且只调用 transform(),而不是 fit_transform().

2) You are fitting the CountVectorizer again on the test data, which is wrong. You should use the same CountVectorizer object on test data that you used on trainind data and only call transform(), not fit_transform().

类似于:

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])

3) GradientBoostingClassifier 适用于稀疏数据.它尚未在文档中提及(似乎是文档中的错误).

3) GradientBoostingClassifier works well with sparse data. Its not mentioned in documentation yet (seems like a mistake on the documentation).

4) 您似乎正在将原始数据的多列转换为词袋形式.为此,您需要使用许多 CountVectorizer 对象,然后将所有输出数据合并到一个数组中,然后将其传递给 GradientBoostingClassifier.

4) You seem to be transforming multiple columns of your original data to bag-of-words form. For that you will need to use those many CountVectorizer objects and then merge all the output data into a single array which you pass to GradientBoostingClassifier.

更新:

您需要进行如下设置:

# To merge sparse matrices
from scipy.sparse import hstack

result_matrix_train = None
result_matrix_test = None

for feature_colunm_name in feature_columns_to_use:
    cv = CountVectorizer()
    X_train_cv = cv.fit_transform(X_train[feature_colunm_name])

    # Merge the vector with others
    result_matrix_train = hstack((result_matrix_train, X_train_cv)) 
                          if result_matrix_train is not None else X_train_cv

    # Now transform the test data
    X_test_cv = cv.transform(X_test[feature_colunm_name])
    result_matrix_test = hstack((result_matrix_test, X_test_cv)) 
                         if result_matrix_test is not None else X_test_cv

注意:如果您还有其他未通过 Countvectorizer 处理的列,因为它们已经是数值型左右,您想将它们与 result_matrix_train 合并,您也可以通过以下方式进行:

Note: If you have other columns also which you did not process through the Countvectorizer because they are already numerical or so, which you want to merge with the result_matrix_train, you can do that too by:

result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values)) 
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values)) 

现在使用这些来训练:

...
grd.fit(result_matrix_train, y_train.values)

这篇关于使用 CountVectorizer 在 python 中进行密集时出现内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆