来自sklearn.metrics.silhouette_samples的MemoryError [英] MemoryError from sklearn.metrics.silhouette_samples

查看:155
本文介绍了来自sklearn.metrics.silhouette_samples的MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试调用 sklearn时,出现内存错误. metrics.silhouette_samples .我的用例与此教程相同.我在Python 3.5中使用scikit-learn 0.18.1.

I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5.

对于相关功能, silhouette_score ,这篇帖子建议使用 sample_size 参数可在调用Silhouette_samples之前减小样本大小.我不确定下采样是否还会产生可靠的结果,所以我犹豫了.

For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that.

我的输入X是一个[107545行x 12列]数据帧,尽管我只有8gb的RAM,但我并不会认为它很大.

My input, X, is a [107545 rows x 12 columns] dataframe which I would not really consider to be big, although I do only have 8gb of RAM

sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])

C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
    167     check_number_of_labels(len(le.classes_), X.shape[0])
    168 
--> 169     distances = pairwise_distances(X, metric=metric, **kwds)
    170     unique_labels = le.classes_
    171     n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1248 
   1249 
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1088     if n_jobs == 1:
   1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    244         YY = row_norms(Y, squared=True)[np.newaxis, :]
    245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    247     distances *= -2
    248     distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 
MemoryError: 

该计算似乎依赖于 euclidean_distances np.dot .我在这里没有解决稀缺问题,所以也许没有解决办法.在计算距离时,我通常使用 numpy.linalg.norm (AB).这样是否具有更好的内存处理能力?

The calculation seems to rely on euclidean_distances which crashed on the call of np.dot. I am not dealing with scarcity here so maybe there is no solution. When calculating distance I normally use numpy.linalg.norm(A-B). Does this have better memory handling?

推荐答案

更新: PR 11135 应该在scikit-learn中解决此问题,从而使其余文章过时.

Update: PR 11135 should resolve this issue within scikit-learn, making the rest of the post obsolete.

您大约有100000 = 1e5个样本,它们是12维空间中的点. pairwise_distances方法正在尝试计算它们之间的所有成对距离.即(1e5)** 2 = 1e10距离.每个都是浮点数; float64格式占用8个字节的内存.因此,距离矩阵的大小为8e10字节,即74.5 GB.

You have about 100000 = 1e5 samples, which are points in 12-dimensional space. The pairwise_distances method is trying to compute all pairwise distances between them. That is (1e5)**2 = 1e10 distances. Each is a floating point number; float64 format takes 8 bytes of memory. So the size of the distance matrix is 8e10 bytes, which is 74.5 GB.

偶尔会在GitHub上报告此问题:#4701 #4197 的答案大致是:这是一个NumPy问题,它可以不能使用该大小的矩阵处理np.dot.尽管有一条评论

This is occasionally reported on GitHub: #4701, #4197 with the answer being roughly: it's a NumPy problem that it can't handle np.dot with matrices of that size. Although there was one comment saying

可能有可能将其分解为多个子矩阵,以使计算效率更高的内存.

it might be possible to break this up into sub-matrices to do the calculation more memory efficient.

实际上,如果该方法不是在开始时形成一个巨型距离矩阵,而是在

Indeed, if instead of forming one giant distance matrix at the beginning, the method computed relevant chunks of it in the loop over labels, that would require less memory.

使用它的,这样,它首先会屏蔽而不是先计算距离并随后应用二进制掩码.这是我在下面所做的.代替N**2内存(其中N是样本数),它需要n**2,其中n是最大群集大小.

It is not hard to modify the method using its source so that instead of computing distances first and applying binary masks later, it masks first. This is what I did below. Instead of N**2 memory, where N is the number of samples, it requires n**2 where n is the maximal cluster size.

如果这看起来很实用,我想它可以通过一些标志添加到Scikit中.但是应该注意,该版本不支持metric='precomputed'.

If this is something that looks practical, I imagine it could be added to Scikit by way of some flag... one should note that this version does not support metric='precomputed', though.

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
            intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
                other_distances = np.mean(inter_distances, axis=1)
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

这篇关于来自sklearn.metrics.silhouette_samples的MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆