Python:计算pariwise距离会导致内存错误 [英] Python: computing pariwise distances causes memory error

查看:212
本文介绍了Python:计算pariwise距离会导致内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算57832个向量的成对距离.每个向量都有200个维度.我正在使用pdist来计算距离.

I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.

from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).

但是,这会导致内存错误.

However, this causes a memory error.

   Traceback (most recent call last):
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
    result_clustering = clf_clustering.getCVResult(shuffle)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
    self.centroids_of_categories(X_train, y_train)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
    cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
    distance_dict, rho_dict = self.compute_distances_and_rhos(X)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
    pairwise_distances = pdist(X, 'cosine')
  File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError

我的笔记本电脑的RAM为16GB.我该如何解决?还是有更好的方法?

The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?

推荐答案

在大型数据集上执行基于矩阵的算法是禁止的.

Doing matrix-based algorithms on large data sets is prohibitive.

内存需求很容易估算.即使利用对称性,许多实现最多也会有大约65000个实例.但是,即使是64位的实现和大型计算机也最终会放弃.一个具有双精度和利用对称性的1000000x1000000矩阵需要4 TB的RAM.

The memory requirements are straightforward to estimate. Even with exploiting symmetry, many implementations will max out at about 65000 instances. But even 64 bit implementations and big machines will eventually give up. A 1000000x1000000 matrix with double precision and exploiting symmetry needs 4 TB of RAM.

使用不需要O(n ^ 2)内存和运行时的更好算法.

Use better algorithms that don't need O(n^2) memory and runtime.

这篇关于Python:计算pariwise距离会导致内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆