使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError [英] MemoryError in toarray when using DictVectorizer of Scikit Learn
问题描述
我正在尝试对我的数据实施 SelectKBest 算法,以从中获取最佳特征.为此,我首先使用 DictVectorizer 预处理我的数据,数据由 1061427 行组成,具有 15 个特征.每个特征都有许多不同的值,我相信由于基数高,我会遇到内存错误.
I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.
我收到以下错误:
File "FeatureExtraction.py", line 30, in <module>
quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
有没有其他方法可以做到这一点?为什么我在具有 256GB RAM 的机器上处理时会出现内存错误.
Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.
感谢任何帮助!
推荐答案
在执行 fit_transform 时,不是将整个字典传递给它,而是创建一个只有唯一出现的字典.下面是一个例子:
While performing fit_transform, instead of passing the whole dictionary to it, create a dictionary with only unique occurences. Here is the an example:
转换字典:
之前
[ {A:1,B:22.1,C:Red,D:AB12},
{A:2,B:23.3,C:Blue,D:AB12},
{A:3,B:20.2,C:Green,D:AB65},
]
之后
[ {A:1,B:22.1,C:Red,D:AB12},
{C:Blue},
{C:Green,D:AB65},
]
这样可以节省大量空间.
This saves a lot of space.
这篇关于使用 Scikit Learn 的 DictVectorizer 时 toarray 中的 MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!