使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError [英] MemoryError in toarray when using DictVectorizer of Scikit Learn

查看：50 发布时间：2021/7/16 19:56:08 python scipy scikit-learn

本文介绍了使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对我的数据实施 SelectKBest 算法，以从中获取最佳特征.为此，我首先使用 DictVectorizer 预处理我的数据，数据由 1061427 行组成，具有 15 个特征.每个特征都有许多不同的值，我相信由于基数高，我会遇到内存错误.

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.

我收到以下错误:

File "FeatureExtraction.py", line 30, in <module>
    quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
    return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

有没有其他方法可以做到这一点?为什么我在具有 256GB RAM 的机器上处理时会出现内存错误.

Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.

感谢任何帮助！

推荐答案

在执行 fit_transform 时，不是将整个字典传递给它，而是创建一个只有唯一出现的字典.下面是一个例子:

While performing fit_transform, instead of passing the whole dictionary to it, create a dictionary with only unique occurences. Here is the an example:

转换字典:

之前

[ {A:1,B:22.1,C:Red,D:AB12},
      {A:2,B:23.3,C:Blue,D:AB12},
  {A:3,B:20.2,C:Green,D:AB65},
    ]

之后

    [ {A:1,B:22.1,C:Red,D:AB12},
      {C:Blue},
  {C:Green,D:AB65},
    ]

这样可以节省大量空间.

This saves a lot of space.

这篇关于使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError [英] MemoryError in toarray when using DictVectorizer of Scikit Learn

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError [英] MemoryError in toarray when using DictVectorizer of Scikit Learn

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭