scikit-learn DBSCAN 内存使用 [英] scikit-learn DBSCAN memory usage

查看:47
本文介绍了scikit-learn DBSCAN 内存使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新: 最后,我选择用于对大型数据集进行聚类的解决方案是下面 Anony-Mousse 建议的解决方案.也就是说,使用 ELKI 的 DBSCAN 实现来进行我的聚类,而不是 scikit-learn 的.它可以从命令行运行并使用适当的索引,在几个小时内执行此任务.使用 GUI 和小样本数据集计算出您想要使用的选项,然后前往镇上.值得一看.任何人,请继续阅读我的原始问题的描述和一些有趣的讨论.

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.

我有一个包含大约 250 万个样本的数据集,每个样本都有 35 个我试图聚类的特征(浮点值).我一直在尝试使用 scikit-learn 的 DBSCAN 实现来做到这一点,使用曼哈顿距离度量和从数据中抽取的一些小随机样本估计的 epsilon 值.到现在为止还挺好.(这里是片段,供参考)

I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)

db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)

我目前的问题是我很容易耗尽内存.(我目前正在使用 16 GB 内存的机器)

My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)

我的问题是,DBSCAN 是否在运行时动态计算成对距离矩阵,这就是吞噬我的记忆的原因?(250 万 ^ 2) * 8 个字节显然是大得不得了,我会理解的.我不应该使用 fit() 方法吗?更一般地说,有没有办法解决这个问题,或者我通常在这里吠错树?

My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?

如果答案显而易见,我们深表歉意.这几天我一直在纠结这个.谢谢!

Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!

附录:另外,如果有人能更明确地向我解释 fit(X)fit_predict(X) 之间的区别,我也很感激——我'恐怕我只是不太明白.

Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.

附录#2:可以肯定的是,我只是在一台内存约为 550 GB 的机器上尝试过它,但它仍然爆炸了,所以我觉得 DBSCAN 可能正在尝试制作成对距离矩阵或我显然不知道的东西'不想这样做.我想现在最大的问题是如何阻止这种行为,或者找到其他可能更适合我需求的方法.谢谢你在这里陪我.

Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.

附录 #3(!):我忘了附上回溯,在这里,

Addendum #3(!): I forgot to attach the traceback, here it is,

Traceback (most recent call last):
  File "tDBSCAN.py", line 34, in <module>
    db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
    self.fit(X)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
    **self.get_params())
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
    D = pairwise_distances(X, metric=metric)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
    return func(X, Y, **kwds)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
    D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
MemoryError

推荐答案

问题显然是 scikit-learn 中的非标准 DBSCAN 实现.

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN 不需要距离矩阵.该算法是围绕使用一个可以加速regionQuery函数的数据库设计的,并有效地返回查询半径内的邻居(空间索引应该支持O(log n)).

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

scikit 中的实现显然计算了完整的 O(n^2) 距离矩阵,这在内存方面和运行时方面都是有代价的.

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

所以我看到两个选择:

  1. 您可能想在 ELKI 中尝试 DBSCAN 实现,当与R*-tree 索引通常比简单的实现快得多.

  1. You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.

否则,您可能需要重新实现 DBSCAN,因为 scikit 中的实现显然不太好.不要害怕:DBSCAN 很容易自己实现.一个好的 DBSCAN 实现中最棘手的部分实际上是 regionQuery 函数.如果您可以快速获得此查询,DBSCAN 将很快.您实际上也可以将此函数用于其他算法.

Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

更新:现在,sklearn 不再计算距离矩阵,并且可以,例如,使用 kd-tree 索引.但是,由于矢量化",它仍然预先计算每个点的邻居,因此大 epsilon 的 sklearn 内存使用量为 O(n²),而据我所知,ELKI 中的版本只会使用O(n) 内存.因此,如果内存不足,选择较小的 epsilon 和/或尝试 ELKI.

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

这篇关于scikit-learn DBSCAN 内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆