scikit学习DBSCAN内存使用情况 [英] scikit-learn DBSCAN memory usage

查看:308
本文介绍了scikit学习DBSCAN内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:最后,我选择用于对大型数据集进行聚类的解决方案是以下Anony-Mousse提出的一种解决方案。也就是说,使用ELKI的DBSCAN隐式方法进行群集,而不是scikit-learn。它可以从命令行运行,并具有适当的索引编制,可以在几个小时内执行此任务。使用GUI和小型样本数据集找出您要使用的选项,然后前往城镇。值得一看。任何人,请继续阅读,以描述我的原始问题和进行一些有趣的讨论。

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.

我有一个约250万个样本的数据集,每个样本都有35个特征(浮点值)我正在尝试集群。我一直在尝试使用scikit-learn的DBSCAN实现,使用曼哈顿距离度量标准和从数据中提取的一些随机小样本估算出的epsilon值。到现在为止还挺好。 (以下是代码段,仅供参考)

I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)

db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)

目前,我的问题是我很容易用完内存。 (我目前正在使用具有16 GB RAM的计算机)

My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)

我的问题是,DBSCAN是在运行时实时计算成对距离矩阵吗,那就是什么让我记忆犹新? (250万^ 2)* 8字节显然是非常大的,我会理解的。我是否应该不使用 fit()方法?更笼统地说,有没有办法解决这个问题,或者我通常会在这里树错树吗?

My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?

如果答案很明显,那么表示歉意。几天来,我一直对此感到困惑。谢谢!

Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!

附录:如果有人能解释 fit(X) fit_predict(X)对我来说更明确些,我也将不胜感激–恐怕我只是不太了解它。

Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.

附录2:可以肯定的是,我只是在具有约550 GB RAM的计算机上尝试了此方法,但它仍然被炸毁,因此我觉得DBSCAN可能试图建立成对距离矩阵或我显然不知道的东西想要它做。我想现在最大的问题是如何停止这种行为,或者找到其他更适合我需要的方法。感谢您在这里与我保持联系。

Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.

附录#3(!):我忘记附加回溯了,就是这样,

Addendum #3(!): I forgot to attach the traceback, here it is,

Traceback (most recent call last):
  File "tDBSCAN.py", line 34, in <module>
    db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
    self.fit(X)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
    **self.get_params())
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
    D = pairwise_distances(X, metric=metric)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
    return func(X, Y, **kwds)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
    D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
MemoryError


推荐答案

问题显然是非标准的DBSCAN i scikit-learn 中的实现。

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN不需要距离矩阵。该算法是围绕使用可加速 regionQuery 函数并有效返回查询半径内的邻居的数据库而设计的(空间索引应支持 O(log n))。

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

scikit 中的实现,显然,计算了完整的 O(n ^ 2)距离矩阵,这在内存和运行时方面都有代价。

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

所以我看到两个选择:


  1. 您可能想在 ELKI ,当与R * -tree索引一起使用时,它通常比幼稚的实现要快得多。

  1. You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.

否则,您可能要重新实现DBSCAN ,因为 scikit 中的实现显然不是不太好不必担心:DBSCAN真的很容易实现。良好的DBSCAN实现中最棘手的部分实际上是 regionQuery 函数。如果您可以快速获得此查询,则DBSCAN将很快。而且,您实际上也可以将该函数重用于其他算法。

Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

更新:现在,sklearn不再计算距离矩阵,并且可以使用例如kd-tree索引。但是,由于矢量化,它会 still 预先计算每个点的邻居,因此对于大型epsilon来说,sklearn的内存使用量为O(n²),而据我了解,ELKI中的版本仅会使用O(n)内存。因此,如果内存不足,请选择一个较小的epsilon 和/或尝试 ELKI

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

这篇关于scikit学习DBSCAN内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆