标签传播-数组太大 [英] Label Propagation - Array is too big

查看:81
本文介绍了标签传播-数组太大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在scikit learning中使用标签传播进行半监督分类。我有7个维度的17,000个数据点。我无法在此数据集上使用它。它引发了一个numpy的大数组错误。但是,当我处理相对较小的数据集(例如200点)时,它工作正常。谁能建议修复?

I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix?

label_prop_model.fit(np.array(data), labels)
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit
graph_matrix = self._build_graph()
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 108, in _build_graph
affinity_matrix = self._get_kernel(self.X_) # get the affinty martix from the data using rbf kernel
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 26, in _get_kernel
return rbf_kernel(X, X, gamma=self.gamma)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 350, in rbf_kernel
K = euclidean_distances(X, Y, squared=True)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 173, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/lib/pymodules/python2.7/sklearn/utils/extmath.py", line 79, in safe_sparse_dot
return np.dot(a, b)
ValueError: array is too big.


推荐答案

您的计算机有多少内存?

How much memory does your computer have?

sklearn在这里可能正在做的事情(我没有遍历源代码,所以我可能是错的)通过取平方的平方来计算每个数据点之间向量的欧几里得长度17000xK矩阵。这将为所有数据点产生平方的欧几里德距离,但是不幸的是,如果您有N个数据点,则会生成一个NxN输出矩阵。据我所知,numpy使用双精度,这导致了一个17000x17000x8字节的矩阵,大约为2.15 GB。

What sklearn might be doing here (I haven't gone through the source, so I might be wrong) is calculating euclidean lengths of vectors between each data point by taking the square of a 17000xK matrix. This would yield squared euclidean distance for all data points, but unfortunately produces an NxN ouput matrix if you have N data points. As far as I know numpy uses double precision, which results in an 17000x17000x8 bytes matrix, approximately 2.15 GB.

如果您的内存无法容纳该大小的矩阵,会造成麻烦。尝试使用numpy创建这种大小的矩阵:

If your memory can't hold a matrix of that size that would cause trouble. Try creating a matrix of this size with numpy:

import numpy
mat = numpy.ones(17000, 17000)

如果成功,我会弄错了,问题出在其他地方(尽管当然与内存大小有关,

If it succeeds I'm mistaken and the problem is something else (though certainly related to memory size and matrices sklearn is trying to allocate).

在我的头上,解决此问题的一种方法可能是通过对未标记的数据点进行二次采样来传播标签。如果您有很多标记点,则可能会添加)。如果您能够针对17000/2个数据点运行该算法,并且具有 L 标记点,则可以通过随机绘制(17000- L )/ 2来构建新数据集原始集合中未标记的点,并将它们与 L 标记的点组合。对整个集合的每个分区运行该算法。

On the top of my head, one way to resolve this might be to propagate labels in parts by subsampling the unlabeled data points (and possibly the labeled points, if you have many of them). If you are able to run the algorithm for 17000/2 data points and you have L labeled points, build your new data set by randomly drawing (17000-L)/2 of the unlabeled points from the original set and combining them with the L labeled points. Run the algorithm for each partition of the full set.

请注意,这可能会降低标签传播算法的性能,因为它将处理的数据点更少。每个组中标签之间的不一致也可能引起麻烦。
使用时要格外小心,并且只有在您有某种评估效果的方法时:)

Note that this probably will reduce the performance of the label propagation algorithm, since it will have fewer data points to work with. Inconsistencies between labels in each of the sets might also cause trouble. Use with extreme caution and only if you have some way to evaluate the performance :)

更安全的方法是 A :获得更多的内存,或者 B :获得一种占用较少内存的标签传播算法。通过在需要时重新计算欧几里得距离,而不是像scikit似乎在这里构造一个完整的所有对距离矩阵,可以肯定地用内存复杂度交换时间复杂度。

A safer approach would be to A: Get more memory or B: Get a label propagation algorithm that is less memory intensive. It is certainly possible to exchange memory complexity for time complexity by recalculating euclidean distances when needed rather than constructing a full all pairs distance matrix as scikit appears to be doing here.

这篇关于标签传播-数组太大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆