Python:计算pariwise距离会导致内存错误 [英] Python: computing pariwise distances causes memory error
问题描述
我想计算57832个向量的成对距离.每个向量都有200个维度.我正在使用pdist来计算距离.
I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.
from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).
但是,这会导致内存错误.
However, this causes a memory error.
Traceback (most recent call last):
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
result_clustering = clf_clustering.getCVResult(shuffle)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
self.centroids_of_categories(X_train, y_train)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
distance_dict, rho_dict = self.compute_distances_and_rhos(X)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
pairwise_distances = pdist(X, 'cosine')
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
我的笔记本电脑的RAM为16GB.我该如何解决?还是有更好的方法?
The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?
推荐答案
在大型数据集上执行基于矩阵的算法是禁止的.
Doing matrix-based algorithms on large data sets is prohibitive.
内存需求很容易估算.即使利用对称性,许多实现最多也会有大约65000个实例.但是,即使是64位的实现和大型计算机也最终会放弃.一个具有双精度和利用对称性的1000000x1000000矩阵需要4 TB的RAM.
The memory requirements are straightforward to estimate. Even with exploiting symmetry, many implementations will max out at about 65000 instances. But even 64 bit implementations and big machines will eventually give up. A 1000000x1000000 matrix with double precision and exploiting symmetry needs 4 TB of RAM.
使用不需要O(n ^ 2)内存和运行时的更好算法.
Use better algorithms that don't need O(n^2) memory and runtime.
这篇关于Python:计算pariwise距离会导致内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!